neural-networks-and-deep-learning-notes
neural-networks-and-deep-learning-notes
UNIT - II: Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets,
Maxnet, Hamming Network, Kohonen Self-Organizing Feature Maps, Learning Vector
Quantization, Counter Propagation Networks, Adaptive Resonance Theory Networks. Special
Networks Introduction to various networks.
UNIT - III : Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms .
UNIT - IV :Regularization for Deep Learning Parameter norm Penalties, Norm Penalties as
Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early
Stopping, Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and
other Ensemble Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop and
Manifold, Tangent Classifier
UNIT – V: Optimization for Train Deep Models Challenges in Neural Network Optimization,
Basic Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning
Rates, Approximate Second-Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural
Language Processing
1. Connections:-
An ANN consists of a set of highly interconnected processing elements such that each
processing element's output is found to be connected through weights to the other processing
elements or to itself; delay leads and lag-free connections are allowed. Hence, the arrangement
of these processing elements and the geometry of their interconnections are essential for an
ANN. The point where the connection originates and terminate should be noted, and the
function of each processing element in an ANN should be specified.
The arrangement of neurons to form layers and connection pattern formed within and between
layers is called the network architecture.
There are five basic types of neuron connection architectures:-
A multilayer feed forward network is formed by the interconnection of several layers. The input
layer is that which receives the input and this layer has no function except buffering the input
signal. The output layer generates the output of the network. Any layer that is formed between
the input layer and the output layer is called the hidden layer.
If the feedback of the output of the processing elements is directed back as an input to the
processing elements in the same layer then it is called lateral feedback.
Competitive Net
The competitive interconnections have fixed weight-εε. This net is called Maxnet and we will
study in the Unsupervised learning network Category.
****
➢ Important Terminologies :
The field of artificial neural networks has developed alongside many disciplines, such
as neurobiology, mathematics, statistics, economics, computer science, engineering and
physics, to mention but a few. Consequently, the terminology used in the field varies
from discipline to discipline. We present four of them.
1. Activation Function: Algorithm for computing the activation value of a neurode
as a function of its net input. Net input is typically the sum of weighted inputs to
the neurode.
2. Feed forward Network: Network ordered into layers with no feedback paths. The
lowest layer is the input layer, the highest is the output layer. The outputs of a given
layer go only to higher layers and its inputs come only from lower layers.
3. Supervised Learning: Learning procedure in which a network is presented with a
set of input pattern and target pairs. The network can compare its output to the target
and adapt itself according to the learning rules.
• If shape of object is rounded and depression at top having color Red then it will be
labelled as –Apple.
• If shape of object is long curving cylinder having color Green-Yellow then it will be
labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from
basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in Banana category. Thus the machine learns the things from training
data(basket containing fruits) and then apply the knowledge to test data(new fruit).
Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such
as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
****
➢ Perceptron Network :
Developed by Frank Rosenblatt by using McCulloch and Pitts model, perceptron is the
basic operational unit of artificial neural networks. It employs supervised learning rule and is
able to classify the data into two classes.
Operational characteristics of the perceptron: It consists of a single neuron with an arbitrary
number of inputs along with adjustable weights, but the output of the neuron is 1 or 0
depending upon the threshold. It also consists of a bias whose weight is always 1. Following
figure gives a schematic representation of the perceptron.
Training Algorithm
Perceptron network can be trained for single output unit as well as multiple output units.
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si (i=1to n)
Step 5 − Now obtain the net input with the following relation −
Yin=b+
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output.
f(Yin)= 1 if Yin> θ
0 if – θ Yin θ
-1 if Yin< - θ
Wi(new)=Wi(old)+αtxi
b(new)=b(old)+αt
Case 2 − if y = t then,
Wi(new)=Wi(old)
b(new)=b(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Step 8 − Test for the stopping condition, which would happen when there is no change in
weight.
The following diagram is the architecture of perceptron for multiple output classes.
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si (i=1 to n)
Step 5 − Obtain the net input with the following relation −
Yin=b+
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output for each output
unit j = 1 to m −
f(Yin)= 1 if Yinj>θ
0 if – θ<=Yinj<= θ
-1 if Yinj< - θ
Adaline which stands for Adaptive Linear Neuron, is a network having a single linear unit. It
was developed by Widrow and Hoff in 1960. Some important points about Adaline are as
follows −
• It uses bipolar activation function.
• It uses delta rule for training to minimize the Mean-Squared Error MSEMSE between
the actual output and the desired/target output.
• The weights and the bias are adjustable.
Architecture:
The basic structure of Adaline is similar to perceptron having an extra feedback loop with the
help of which the actual output is compared with the desired/target output. After comparison
on the basis of training algorithm, the weights and bias will be updated.
Training Algorithm
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every bipolar training pair s:t.
Step 4 − Activate each input unit as follows −
xi=si (i=1 to n)
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output −
f(Yin)= 1 if Yin 0
-1 if Yin <0
b(new)=b(old)+α(t−Yin)
Case 2 − if y = t then,
Wi(new)=Wi(old)
b(new)=b(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
(t−Yin) is the computed error.
Step 8 − Test for the stopping condition, which will happen when there is no change in weight
or the highest weight change occurred during training is smaller than the specified tolerance.
****
➢ Back Propogation Network :
Back Propagation Neural BPNBPN is a multilayer neural network consisting of the input
layer, at least one hidden layer and output layer. As its name suggests, back propagating will
take place in this network. The error which is calculated at the output layer, by comparing the
target output and the actual output, will be propagated back towards the input layer.
Architecture
As shown in the diagram, the architecture of BPN has three interconnected layers having
weights on them. The hidden layer as well as the output layer also has bias, whose weight is
always 1, on them. As is clear from the diagram, the working of BPN is in two phases. One
phase sends the signal from the input layer to the output layer, and the other phase back
propagates the error from the output layer to the input layer.
Training Algorithm
For training, BPN will use binary sigmoid activation function. The training of BPN will have
the following three phases.
• Phase 1 − Feed Forward Phase
• Phase 2 − Back Propagation of error
• Phase 3 − Updating of weights
All these steps will be concluded in the algorithm as follows
Step 1 − Initialize the following to start the training −
• Weights
• Learning rate αα
For easy calculation and simplicity, take some small random values.
Step 2 − Continue step 3-11 when the stopping condition is not true.
Step 3 − Continue step 4-10 for every training pair.
Phase 1
Step 4 − Each input unit receives input signal xi and sends it to the hidden unit for all i = 1 to
n
Step 5 − Calculate the net input at the hidden unit using the following relation −
Qinj=b0j+
Here b0j is the bias on hidden unit, vij is the weight on j unit of the hidden layer coming
from i unit of the input layer.
Now calculate the net output by applying the following activation function
Qj=f(Qinj)
Send these output signals of the hidden layer units to the output layer units.
Step 6 − Calculate the net input at the output layer unit using the following relation −
Yink= b0k+
Here b0k is the bias on output unit, wjk is the weight on k unit of the output layer coming
from j unit of the hidden layer.
Calculate the net output by applying the following activation function
Yk=f(Yink)
Phase 2
Step 7 − Compute the error correcting term, in correspondence with the target pattern received
at each output unit, as follows −
δk=(tk−Yk)f′(Yink)
On this basis, update the weight and bias as follows −
ΔVjk=αδkQij
Δb0k=αδk
Then, send δkδk back to the hidden layer.
Step 8 − Now each hidden unit will be the sum of its delta inputs from the output units.
δinj=
Phase 3
Step 9 − Each output unit (ykk = 1 to m) updates the weight and bias as follows −
Vjk(new)=Vjk(old)+ΔVjk
b0k(new)=b0k(old)+Δb0k
Step 10 − Each output unit (zjj = 1 to p) updates the weight and bias as follows −
Wij(new)=Wij(old)+ΔWij
b0j(new)=b0j(old)+Δb0j
Step 11 − Check for the stopping condition, which may be either the number of epochs
reached or the target output matches the actual output.
****
This is a single layer neural network in which the input training vector and the output target
vectors are the same. The weights are determined so that the network stores a set of patterns.
Architecture
As shown in the following figure, the architecture of Auto Associative memory network
has ‘n’ number of input training vectors and similar ‘n’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1 to n, j=1 to n
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
Yinj=
Similar to Auto Associative Memory network, this is also a single layer neural network.
However, in this network the input training vector and the output target vectors are not the
same. The weights are determined so that the network stores a set of patterns. Hetero
associative network is static in nature, hence, there would be no non-linear and delay
operations.
Architecture
As shown in the following figure, the architecture of Hetero Associative Memory network
has ‘n’ number of input training vectors and ‘m’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1 to n, j=1 to m
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
Xi=Si (i=1 to n)
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
Yinj=
This is the sum of the outer product matrices required to store each association
separately. In general, we shall use the preceding formula or the more concise vector
matrix form,
W= t(p)
Several authors normalize the weights found by the Hebb rule by a factor of 1/n, where
n is the number of units in the system
2- Delta Rule for Pattern Association :
In its original form, the delta rule assumed that the activation function for the
output unit was the identity function. Thus, using y for the computed output for the
input vector x, we have
Yj=netJ=
Topology
A BAM contains two layers of neurons, which we shall denote X and Y. Layers X and Y are
fully connected to each other. Once the weights have been established, input into layer X
presents the pattern in layer Y, and vice versa
Procedure
Learning
Imagine we wish to store two associations, A1:B1 and A2:B2.
M= 2 0 0 -2
0 -2 2 0
2 0 0 -2
-2 0 0 2
0 2 -2 0
-2 0 0 2
Hopfield Networks:
Hopfield neural network was invented by Dr. John J. Hopfield in 1982. It consists of a single
layer which contains one or more fully connected recurrent neurons. The Hopfield network is
commonly used for auto-association and optimization tasks.
1.Discrete Hopfield Network:
A Hopfield network which operates in a discrete line fashion or in other words, it can be said
the input and output patterns are discrete vector, which can be either binary 0,10,1 or
bipolar +1,−1+1,−1 in nature. The network has symmetrical weights with no self-connections
i.e., wij = wji and wii = 0.
Architecture
Following are some important points to keep in mind about discrete Hopfield network −
• This model consists of neurons with one inverting and one non-inverting output.
• The output of each neuron should be the input of other neurons but not the input of
self.
• Weight/connection strength is represented by wij.
• Connections can be excitatory as well as inhibitory. It would be excitatory, if the output
of the neuron is same as the input, otherwise inhibitory.
• Weights should be symmetrical, i.e. wij = wji
The output from Y1 going to Y2, Yi and Yn have the weights w12, w1i and w1n respectively.
Similarly, other arcs have the weights on them.
Training Algorithm
During training of discrete Hopfield network, weights will be updated. As we know that we
can have the binary input vectors as well as bipolar input vectors. Hence, in both the cases,
weight updates can be done with the following relation
Case 1 − Binary input patterns
For a set of binary patterns sp, p = 1 to P
Here, sp = s1p, s2p,..., sip,..., snp
Weight Matrix is given by
Wij= for i j
Testing Algorithm
Step 1 − Initialize the weights, which are obtained from training algorithm by using Hebbian
principle.
Step 2 − Perform steps 3-9, if the activations of the network is not consolidated.
Step 3 − For each input vector X, perform steps 4-8.
Step 4 − Make initial activation of the network equal to the external input vector X as follows
−
Yi=Xi for i=1 to n
Step 5 − For each unit Yi, perform steps 6-9
Step 6 − Calculate the net input of the network as follows −
Yini=xi+
Step 7 − Apply the activation as follows over the net input to calculate the output −
Yi = 1 if Yini >θi
Yi if Yini = θi
0 if Yini < θi
Here θi is the threshold.
Step 8 − Broadcast this output yi to all other units.
Step 9 − Test the network for conjunction.
2.Continuous Hopfield Network
In comparison with Discrete Hopfield network, continuous network has time as a continuous
variable. It is also used in auto association and optimization problems such as travelling
salesman problem.
Model − The model or architecture can be build up by adding electrical components such as
amplifiers which can map the input voltage to the output voltage over a sigmoid activation
function.
THE END
UNIT – II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet,
Hamming Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization,
Counter Propagation Networks, Adaptive Resonance Theory Networks. Special Networks
Introduction to various networks.
➢ Introduction
Unsupervised learning is a type of machine learning that looks for previously undetected
patterns in a data set with no pre-existing labels and with a minimum of human supervision. In
contrast to supervised learning that usually makes use of human-labeled data, unsupervised
learning, also known as self-organization allows for modeling of probability densities over
inputs. It forms one of the three main categories of machine learning, along
with supervised and reinforcement learning. Semi-supervised learning, a related variant,
makes use of supervised and unsupervised techniques.
Two of the main methods used in unsupervised learning are principal
component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or
segment, datasets with shared attributes in order to extrapolate algorithmic
relationships. Cluster analysis is a branch of machine learning that groups the data that has not
been labelled, classified or categorized. Instead of responding to feedback, cluster analysis
identifies commonalities in the data and reacts based on the presence or absence of such
commonalities in each new piece of data. This approach helps detect anomalous data points
that do not fit into either group.
1- Maxnet
Auxiliary mode
• Activation functions
Maxnet
architecture
є
1 n
є є
є
i є j
Maxnet Algorithm
1 for i = j
ωij =
-є i≠j
Step 2: If more than one node has nonzero output, do step 3 to 5.
Step 3: Update the activation (output) at each node for
j = 1, 2, 3……., n
aj (t+1) = f [ aj (t) – є ∑ ai (t)] i ≠ j
Step 5: Test for stopping condition. If more than one node has a nonzero output then
Example: A Maxnet has three inhibitory weights a 0.25 (є = 0.25). The net is initially
activated by the input signals [0.1 0.3 0.9]. The activation function of the neurons is:
1
1
(net) = 0 otherwise
0.25 0.25
Find the final winning neuron. 3
Solution:
2.Hamming Net:
Hamming Net
Structure
1
Y
1 net 1 Class 1
1
Maxnet
X2
3 Y Class 2
2 net2
4
Wij = ei(j)/2
Where ei(j) is the i'th component of the j'th exemplar vector.
Terminology
M: number of exemplar vectors
N: number of input nodes (input vector components) E(j) : j'th
exemplar vector
Algorithm:
Step 3: Maxnet iterations are used to find the best match exemplar.
٠
١
Example: Given the exemplar vector e(1)=(-1 1 1 -1) and
e(2)=(1 -1 1 -1). Use Hamming net to find the exemplar vector close to bipolar input
patterns
(1 1 -1 -1), (1 -1 -1 -1), (-1 -1 -1 1) and (-1 -1 1 1).
Y Class 1
1 net 1
1
Maxnet
X2
3 Class 2
Y
2
4
b2
Solution:
-0.5 0.5
0.5 -0.5
0.5 0.5
-0.5 -0.5
Since e(1) = (-1 1 1 -1) and e(2) = (1 -1 1 -1).
bj = n/2 = 2
٠
١
Downloaded by Raghu Polishetti (ragspol@gmail.com)
lOMoARcPSD|34321553
٠
١
step 2: Apply 1st bipolar input (1 1 -1 -1) Yin1
= b1 + ∑ xi wi1
= 2 + (1 1 -1 -1) .* (-0.5 0.5 0.5 -0.5)
=2
Yin2 = b2 + ∑ xi wi2
= 2 + (1 1 -1 -1) .* (0.5 -0.5 0.5 -0.5)
=2
Hence, the first input patter has the same Hamming distance HD = 2
with both exemplar vectors.
There can be various topologies, however the following two topologies are used the most −
28
٠
١
Rectangular Grid Topology
This topology has 24 nodes in the distance-2 grid, 16 nodes in the distance-1 grid, and 8 nodes in
the distance-0 grid, which means the difference between each rectangular grid is 8 nodes. The
winning unit is indicated by #.
This topology has 18 nodes in the distance-2 grid, 12 nodes in the distance-1 grid, and 6 nodes in
the distance-0 grid, which means the difference between each rectangular grid is 6 nodes. The
winning unit is indicated by #.
Architecture
The architecture of KSOM is similar to that of the competitive network. With the help of
neighborhood schemes, discussed earlier, the training can take place over the extended region of
the network.
29
٠
١
Step 1 − Initialize the weights, the learning rate α and the neighborhood topological scheme.
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every input vector x.
Step 4 − Calculate Square of Euclidean Distance for j = 1 to m
D(j)=∑i=1n∑j=1m(xi−wij)2D(j)=∑i=1n∑j=1m(xi−wij)2
Step 5 − Obtain the winning unit J where Djj is minimum.
Step 6 − Calculate the new weight of the winning unit by the following relation −
wij(new)=wij(old)+α[xi−wij(old)]wij(new)=wij(old)+α[xi−wij(old)]
30
٠
١
Following figure shows the architecture of LVQ which is quite similar to the architecture of
KSOM. As we can see, there are “n” number of input units and “m” number of output units. The
layers are fully interconnected with having weights on them.
Parameters Used:
Following are the parameters used in LVQ training process as well as in the flowchart
• x = training vector (x1,...,xi,...,xn)
• T = class for training vector x
• wj = weight vector for jth output unit
• Cj = class associated with the jth output unit
Training Algorithm:
31
٠
١
if T = Cj then wj(new)=wj(old)+α[x−wj(old)]wj(new)=wj(old)+α[x−wj(old)]
if T ≠ Cj then wj(new)=wj(old)−α[x−wj(old)]wj(new)=wj(old)−α[x−wj(old)]
Step 8 − Reduce the learning rate αα.
Step 9 − Test for the stopping condition. It may be as follows −
CPN (Counter propagation network) were proposed by Hecht Nielsen in 1987.They are
multilayer network based on the combinations of the input, output, and clustering layers. The
application of counter propagation net are data compression, function approximation and pattern
association. The counter propagation network is basically constructed from an instar-outstar
model. This model is three layer neural network that performs input-output data mapping,
producing an output vector y in response to input vector x, on the basis of competitive learning.
The three layer in an instar-outstar model are the input layer, the hidden(competitive) layer and
the output layer.
There are two stages involved in the training process of a counter propagation net. The
input vector are clustered in the first stage. In the second stage of training, the weights from the
32
٠
١
cluster layer units to the output units are tuned to obtain the desired response. There are two types
of counter propagation net:
1. Full counter propagation network
2. Forward-only counter propagation network
٠
١
34
٠
١
Step 14: Reduce the learning rates a and b.
a(t+1)=0.5a(t); b(t+1)=0.5b(t)
Step 15: Test stopping condition for phase II training.
.
Training Algorithm for Forward-only CPN:
Step 0: Initialize the weights and learning rates.
Step 1: Perform step 2 to 7 when stopping condition for phase I training is false.
Step 2: Perform step 3 to 5 for each of training input X.
Step 3: Set the X-input layer activation to vector X.
Step 4: Compute the winning cluster unit J. If dot product method is used, find the cluster unit zJ
with the largest net input:
zinj=∑xi.vij
35
٠
١
If Euclidean distance is used, find the cluster unit zJ square of whose distance from the input
pattern is smallest:
Dj=∑(xi-vij)^2
If there exists a tie in the selection of winner unit, the unit with the smallest index is chosen as the
winner.
Step 5: Perform weight updation for unit zJ. For i=1 to n,
viJ(new)=viJ(old) + α[xi-viJ(old)]
Step 6: Reduce learning rate α:
α (t+1)=0.5α(t)
Step 7: Test the stopping condition for phase I training.
Step 8: Perform step 9 to 1 when stopping condition for phase II training is false.
Step 9: Perform step 10 to 13 for each training input pair x:y.
Step 10: Set X-input layer activations to vector X. Set Y-output layer activation to vector Y.
Step 11: Find the winning cluster unit J.
Step 12: Update the weights into unit zJ. For i=1 to n,
viJ(new)=viJ(old) + α[xi-viJ(old)]
Step 13: Update the weights from unit zJ to the output units.
For k=1 to m, wJk(new)=wJk(old) + β[yk-wJk(old)]
Step 14: Reduce learning rate β,
β(t+1)=0.5β(t)
Step 15: Test the stopping condition for phase II training.
36
٠
١
Types of Adaptive Resonance Theory(ART)
Carpenter and Grossberg developed different ART architectures as a result of 20 years of research.
The ARTs can be classified as follows:
• ART1 – It is the simplest and the basic ART architecture. It is capable of clustering binary
input values.
• ART2 – It is extension of ART1 that is capable of clustering continuous-valued input data.
• Fuzzy ART – It is the augmentation of fuzzy logic and ART.
• ARTMAP – It is a supervised form of ART learning where one ART learns based on the
previous ART module. It is also known as predictive ART.
• FARTMAP – This is a supervised ART architecture with Fuzzy logic included.
The adaptive resonant theory is a type of neural network that is self-organizing and
competitive. It can be of both types, the unsupervised ones(ART1, ART2, ART3, etc) or the
supervised ones(ARTMAP). Generally, the supervised algorithms are named with the suffix
“MAP”.
But the basic ART model is unsupervised in nature and consists of :
• The F1 layer accepts the inputs and performs some processing and transfers it to the F2
layer that best matches with the classification factor.
There exist two sets of weighted interconnection for controlling the degree of similarity
between the units in the F1 and the F2 layer.
• The F2 layer is a competitive layer.The cluster unit with the large net input becomes the
candidate to learn the input pattern first and the rest F2 units are ignored.
• The reset unit makes the decision whether or not the cluster unit is allowed to learn the
input pattern depending on how similar its top-down weight vector is to the input vector
and to he decision. This is called the vigilance test.
Thus we can say that the vigilance parameter helps to incorporate new memories or new
information. Higher vigilance produces more detailed memories, lower vigilance
produces more general memories.
Generally two types of learning exists,slow learning and fast learning.In fast learning, weight
update during resonance occurs rapidly. It is used in ART1.In slow learning, the weight change
occurs slowly relative to the duration of the learning trial. It is used in ART2.
• It exhibits stability and is not disturbed by a wide variety of inputs provided to its network.
• It can be integrated and used with various other techniques to give more good results.
37
٠
١
• It can be used for various fields such as mobile robot control, face recognition, land cover
classification, target recognition, medical diagnosis, signature verification, clustering web
users, etc.
• It has got advantages over competitive learning (like bpnn etc). The competitive learning
lacks the capability to add new clusters when deemed necessary
Some ART networks are inconsistent (like the Fuzzy ART and ART1) as they depend upon
the order in which training data, or upon the learning rate.
****
38
٠
١
UNIT – III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms
Deep learning allows the computer to build complex concepts out of simpler
concepts.
Below figure shows how a deep learning system can represent the concept of an image of
a person by combining simpler concepts, such as corners and contours, which are in turn defined
in terms of edges. The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function
mapping some set of input values to output values.
39
٠
١
There are two main ways of measuring the depth of a model. The first view is based
on the number of sequential instructions that must be executed to evaluate the architecture. Above
figure illustrates how this choice of language can give two different measurements for the same
architecture. Another approach, used by deep probabilistic models, regards the depth of a model
as being not the depth of the computational graph but the depth of the graph describing how
concepts are related to each other.
• Deep learning has had a long and rich history, but has gone by many names reflecting
different philosophical viewpoints, and has waxed and waned in popularity.
40
٠
١
• Deep learning has become more useful as the amount of available training data has
increased.
• Deep learning models have grown in size over time as computer infrastructure (both
hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with increasing accuracy
over time.
Broadly speaking, there have been three waves of development of deep learning:
deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006.
Fig: This figure shows two of the three historical waves of artificial neural nets research,
as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural
networks” according to Google Books.
One may wonder why deep learning has only recently become recognized as a
crucial technology though the first experiments with artificial neural networks were conducted in
the 1950s. As our computers are increasingly networked together, it becomes easier to centralize
these records and curate them into a dataset appropriate for machine learning applications. As of
2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve
acceptable performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10 million labeled
41
٠
١
examples. Working successfully with datasets smaller than this is an important research area,
focusing in particular on how we can take advantage of large quantities of unlabeled examples,
with unsupervised or semi-supervised learning.
Another key reason that neural networks are wildly successful today after enjoying
comparatively little success since the 1980s is that we have the computational resources to run
much larger models today. The increase in model size over time, due to the availability of faster
CPUs, the advent of general purpose GPUs, faster network connectivity and better software
infrastructure for distributed computing, is one of the most important trends in the history of deep
learning. This trend is generally expected to continue well into the future.
These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations used to define f, and
finally to the output y. There are no feedback connections in which outputs of the model are fed
back into itself.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is associated
with a directed acyclic graph describing how the functions are composed together. For
example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) =
f(3)(f (2)(f(1) (x ))). These chain structures are the most commonly used structures of neural
networks. In this case, f (1) is called the first layer of the network, f (2) is called the second layer,
and so on. The overall length of the chain gives the depth of the model. It is from this terminology
that the name “deep learning” arises. The final layer of a feedforward network is called the output
layer. The learning algorithm must decide how to use these layers to best implement an
approximation of f∗. Because the training data does not show the desired output for each of these
layers, these layers are called hidden layers.
Finally, these networks are called neural because they are loosely inspired by
neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of
these hidden layers determines the width of the model.
Feedforward networks have introduced the concept of a hidden layer, and this
requires us to choose the activation functions that will be used to compute the hidden layer values.
We must also design the architecture of the network, including how many layers the network
42
٠
١
should contain, how these layers should be connected to each other, and how many units should
be in each layer. Learning in deep neural networks requires computing the gradients of complicated
functions. We present the back-propagation algorithm and its modern generalizations, which can
be used to efficiently compute these gradients.
It has a single hidden layer containing two units. (Left)In this style, we draw every
unit as a node in the graph. This style is very explicit and unambiguous but for networks larger
than this example it can consume too much space. (Right)In this style, we draw a node in the graph
for each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that describe the
relationship between two layers. Here, we indicate that a matrix W describes the mapping from x
to h, and a vector w describes the mapping from h to y.
Gradient-Based Learning
Designing and training a neural network is not much different from training any
other machine learning model with gradient descent. Computing the gradient is slightly more
complicated for a neural network, but can still be done efficiently and exactly.
As with other machine learning models, to apply gradient-based learning we must
choose a cost function, and we must choose how to represent the output of the model.
Cost Functions
An important aspect of the design of a deep neural network is the choice of the cost
function. Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.
43
٠
١
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply
use the principle of maximum likelihood. This means we use the cross-entropy between the
training data and the model’s predictions as the cost function.
The total cost function used to train a neural network will often combine one of the
primary cost functions described here with a regularization term.
➢ Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood. This means
that the cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by
J(θ) = −E x,y∼pˆdata logpmodel(y|x)
Output Units
The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
Any kind of neural network unit that may be used as an output can also be used as
a hidden unit. we suppose that the feedforward network provides a set of hidden features defined
by h = f (x ;θ ). The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.
➢ Linear Units for Gaussian Output Distributions
One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity. These are often just called linear units.
Given features h, a layer of linear output units produces a vector yˆ = WTh+b
.Linear output layers are often used to produce the mean of a conditional
Gaussian distribution:
p(y|x) = N(y;yˆ,I).
Hidden Units
The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default choice
of hidden unit. The design process consists of trial and error, intuiting that a kind of hidden unit
may work well, and then training a network with that kind of hidden unit and evaluating its
performance on a validation set.
Some of the hidden units included in this list are not actually differentiable at all
input points. For example, the rectified linear function g(z) = max{0,z} is not differentiable at z =
0. This may seem like it invalidates g for use with a gradient based learning algorithm.
44
٠
١
Unless indicated otherwise, most hidden units can be described as accepting a
vector of inputs x, computing an affine transformation z = W T x + b, and then applying an element-
wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form
of the activation function .
45
٠
١
One possibility is to not have an activation g(z) at all. One can also think of this as using
the identity function as the activation function. We have already seen that a linear unit can be
useful as the output of a neural network. It may also be used as a hidden unit.
Softmax units are another kind of unit that is usually used as an output but may sometimes
be used as a hidden unit. Softmax units naturally represent a probability distribution over a discrete
variable with k possible values, so they may be used as a kind of switch.
A few other reasonably common hidden unit types include:
• Radial basis function or RBF unit: hi= exp(− 1/σ2i ||W:,i – x||2). This function becomes
more active as x approaches a template W:,i. Because it saturates to for most , it can be difficult to
optimize.
• Softplus: g(a) = ζ(a) = log(1+ea). This is a smooth version of the rectifier for function
approximation and for the conditional distributions of undirected probabilistic models.
• Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the latter, it is
bounded, g(a) = max(−1 , min(1,a)).
Architecture Design
The word architecture refers to the overall structure of the network: how many units
it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural network
architectures arrange these layers in a chain structure, with each layer being a function of the layer
that preceded it. In this structure, the first layer is given by
h(1)= g(1)(W(1)Tx + b(1))
the second layer is given by
h(2)= g(2)(W(2)T h(1) + b(2))
and so on.
In these chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer. The ideal network architecture for a
task must be found via experimentation guided by monitoring the validation set error.
Universal Approximation Properties and Depth
A linear model, mapping from features to outputs via matrix multiplication, can by
definition represent only linear functions. It has the advantage of being easy to train because many
loss functions result in convex optimization problems when applied to linear models.
46
٠
١
The universal approximation theorem states that a feedforward network with a
linear output layer and at least one hidden layer with any “squashing” activation function (such as
the logistic sigmoid activation function) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired non-zero amount of error, provided that the
network is given enough hidden units.
The universal approximation theorem means that regardless of what function we
are trying to learn, we know that a large MLP will be able to represent this function.
In summary, a feedforward network with a single layer is sufficient to represent any
function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In
many circumstances, using deeper models can reduce the number of units required to represent the
desired function and can reduce the amount of generalization error.
i.e., exponential in the depth . In the case of maxout networks with filters per l k unit, the
number of linear regions is
47
٠
١
o(k(l-1)+d)
Figure: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses.
48
٠
١
it produces a scalar cost J (θ). The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to then flow backwards through
the network, in order to compute the gradient..
The term back-propagation is often misunderstood as meaning the whole learning
algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method
for computing the gradient, while another algorithm, such as stochastic gradient descent, is used
to perform learning using this gradient.
Computational Graphs
To describe the back-propagation algorithm more precisely, it is helpful to have a
more precise language. computational graph Many ways of formalizing computation as graphs are
possible. Here, we use each node in the graph to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type. To formalize our graphs, we also need
to introduce the idea of an operation. An operation is a simple function of one or more variables.
49
٠
١
Chain Rule of Calculus
Back-propagation is an algorithm that computes the chain rule, with a specific order
of operations that is highly efficient. Let x be a real number, and let f and g both be functions
mapping from a real number to a real number. Suppose that y = g(x) and z = f(g(x)) = f(y). Then
the chain rule states that
dz/dx = (dz/ dy) (dy/dx ).
Recursively Applying the Chain Rule to Obtain Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that produced that
scalar.
Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose whether
to store these subexpressions or to recompute them several times. An example of how these
repeated subexpressions arise is given in figure .
Figure 6.9: A computational graph that results in repeated subexpressions when computing
the gradient.
Symbol-to-Symbol Derivatives
50
٠
١
Algebraic expressions and computational graphs both operate on symbols, or
variables that do not have specific values. These algebraic and graph-based representations are
called symbolic representations. When we actually use or train a neural network, we must assign
specific values to these symbols. We replace a symbolic input to the network x with a specific
numeric value, such as [1.2,3.765,−1.8]T.
Another approach is to take a computational graph and add additional nodes to the
graph that provide a symbolic description of the desired derivatives.
General Back-Propagation
51
٠
١
More formally, each node in the graph G corresponds to a variable. To achieve
maximum generality, we describe this variable as being a tensor V. Tensor can in general have
any number of dimensions. They subsume scalars, vectors, and matrices.
The back-propagation algorithm itself does not need to know any differentiation
rules. It only needs to call each operation’s bprop rules with the right arguments. Formally,
op.bprop(inputs,X,G) must return
Here, inputs is a list of inputs that are supplied to the operation, op.f is the
mathematical function that the operation implements, X is the input whose gradient we
wish to compute, and G is the gradient on the output of the operation.
Complications
Most software implementations need to support operations that can return more
than one tensor. For example, if we wish to compute both the maximum value in a tensor and the
index of that value, it is best to compute both in a single pass through memory, so it is most efficient
to implement this procedure as a single operation with two outputs.
52
٠
١
The deep learning community has been somewhat isolated from the broader
computer science community and has largely developed its own cultural attitudes concerning how
to perform differentiation. More generally, the field of automatic differentiation is concerned
with how to compute derivatives algorithmically.
The back-propagation algorithm described here is only one approach to automatic
differentiation. It is a special case of a broader class of techniques called reverse mode
accumulation. Other approaches evaluate the subexpressions of the chain rule in different orders.
In general, determining the order of evaluation that results in the lowest computational cost is a
difficult problem. Finding the optimal sequence of operations to compute the gradient is NP-
complete (Naumann, 2008), in the sense that it may require simplifying algebraic expressions into
their least expensive form.
53
٠
١
UNIT - IV : Regularization for Deep Learning Parameter norm Penalties, Norm Penalties as
Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping,
Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and other Ensemble
Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent
Classifier
In Machine Learning, and more so in Deep Learning, overfitting is a major issue that occurs
during training. A model is considered as overfitting the training data when the training error
keeps decreasing but the test error (or the generalisation error) starts increasing. At this point we
tend to believe that the model is learning the training data distribution and not generalising to
unseen data. Regularization is a modification we make to the learning algorithm or the model
architecture that reduces its generalisation error, possibly at the expense of increased training
error. There are various ways of doing this, some of which include restriction on parameter
values or adding terms to the objective function, etc.
These constraints are designed to encode some sort of prior knowledge, with a preference
towards simpler models to promote generalisation (see Occam’s Razor). The sections present in
this chapter are listed below:
54
٠
١
3. Regularization and Under-Constrained Problems
4. Dataset Augmentation
5. Noise Robustness
6. Semi-Supervised Learning
7. Mutlitask Learning
8. Early Stopping
9. Parameter Tying and Parameter Sharing
10. Sparse Representations
11. Bagging and Other Ensemble Methods
12. Dropout
13. Adversarial Training
14. Tangent Distance, Tangent Prop and Manifold Tangent Classifier
The idea here is to limit the capacity (the space of all possible model families) of the model by
adding a parameter norm penalty, Ω(θ), to the objective function, J:
Here, θ
represents only the weights and not the biases, the reason being that the biases require much less
data to fit and do not add much variance.
Applying the 2nd order Taylor-Series approximation (ignoring all terms of order greater than 2 in
the Taylor-Series expansion) at the point w* (where J̃(θ; X, y) assumes the minimum value, i.e.,
∇J̃ (w*)= 0), we get the following expression (as the first order gradient term is 0):
55
٠
١
Finally, ∇ Ĵ(w) = H(w — w*) since the first term is just a constant and the derivative of X’ H X (’
represents the transpose) is 2 H X. The overall gradient of the objective function (gradient of Ĵ +
gradient of Ω(θ))becomes:
As α approaches 0, w comes closer to w*. Finally, since H is real and symmetric, it can be
decomposed into a diagonal matrix ∧ and an orthonormal set of eigenvectors, Q.That is, H = Q’
∧ Q.
56
٠
١
To look at its application to Machine Learning, we have to look at linear regression. The
objective function there is exactly quadratic, given by:
1.2 L¹ parameter
regularization
57
٠
١
Now, the last term, sign(w), creates some difficulty as the gradient no longer scales linearly with
w. This leads to a few complexities in arriving at the optimal solution.
My current interpretation of the max term is that, there shouldn’t be a zero crossing, as the
gradient of the absolute value function is not differentiable at zero.
Thus, L¹ regularization has the property of sparsity, which is its fundamental distinguishing
feature from L². Hence, L¹ is used for feature selection as in LASSO.
From chapter 4’s section 4, we know that to minimize any function under some constraints, we
can construct a generalized Lagrangian function containing the objective function along with the
penalties. Suppose we wanted Ω(θ) < k, then we could construct the following Lagrangian:
We get optimal θ by solving the Lagrangian. If Ω(θ) > k, then the weights need to be
compensated highly and hence, α should be large to reduce its value below k. Likewise, if
Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α should be small. This is now
similar to the parameter norm penalty regularized objective function as both of them encourage
lower values of the norm. Thus, parameter norm penalties naturally impose a constraint, like the
L²-regularization, defining a constrained L²-ball. Larger α implies a smaller constrained region as
it pushes the values really low, hence, allowing a small radius and vice versa. The idea of
constraints over penalties is important for several reasons. Large penalties might cause non-
convex optimization algorithms to get stuck in local minima due to small values of θ, leading to
the formation of so-called dead cells, as the weights entering and leaving them are too small to
have an impact. Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.
Another reason is that constraints induce higher stability. With higher learning rates, there might
be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with reprojection (to the corresponding ball),
prevent the weights from becoming too large, thus, maintaining stability.
58
٠
١
A final suggestion made by Hinton was to restrict the individual column norms of the weight
matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden
unit from having a large weight. The idea here is that if we restrict the Frobenius norm, it doesn’t
guarantee that the individual weights would be small, just their norm. So, we might have large
weights being compensated by extremely small weights to make the overall norm small.
Restricting each hidden unit individually gives us the required guarantee.
Underdetermined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with w as a solution, will always have 2w
as a solution and so on. In some machine learning problems, regularization is necessary. For e.g.,
many algorithms (e.g. PCA) require the inversion of X’ X, which might be singular. In such a
case, we can use a regularized form instead. (X’ X + αI) is guaranteed to be invertible.
Regularization can solve underdetermined problems. For e.g. the Moore-Pentose pseudoinverse
defined earlier as:
4. Data augmentation
Having more data is the most desirable thing to improving a machine learning model’s
performance. In many cases, it is relatively easy to artificially generate data. For a classification
task, we desire for the model to be invariant to certain types of transformations, and we can
generate the corresponding (x,y)pairs by translating the input x. But for certain problems, like
density estimation, we can’t apply this directly unless we have already solved the density
estimation problem.
However, caution needs to be maintained while augmenting data to make sure that the class
doesn’t change. For e.g., if the labels contain both “b” and “d”, then horizontal flipping would be
a bad idea for data augmentation. Adding random noise to the inputs is another form of data
augmentation, while adding noise to hidden units can be seen as doing data augmentation at
multiple levels of abstraction.
Finally, when comparing machine learning models, we need to evaluate them using the same
hand-designed data augmentation schemes or else it might happen that algorithm A outperforms
algorithm B, just because it was trained on a dataset which had more / better data augmentation.
5. Noise Robustness
59
٠
١
Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to
hidden units is very important and is discussed later in Dropout. Noise can even be added to the
weights. This has several interpretations. One of them is that adding noise to weights is a
stochastic implementation of Bayesian inference over the weights, where the weights are
considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is
also interpreted as a more traditional form of regularization by ensuring stability in learning.
For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x,
by reducing the mean square error.
Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights. We
till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes
sure that small perturbations in the weight values don’t affect the predictions much, thus
stabilising training.
Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a
good idea. In such a case, we can add noise to the labels by assigning a probability of (1-ϵ) that
the label is correct and a probability of ϵ that it is not. In the latter case, all the other labels are
equally likely. Label Smoothing regularizes a model with k softmax outputs by assigning the
classification targets with probability (1-ϵ ) or choosing any of the remaining (k-1) classes with
probability ϵ / (k-1).
6. Semi-Supervised Learning
P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I have a
label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
labels. In Semi-supervised Learning, we use both P(x,y)(some labelled samples) and
P(x)(unlabelled samples) to estimate P(y|x)(since we want to predict the class, given the training
sample). We want to learn some representation h = f(x)such that samples which are closer in the
input space have similar representations and a linear classifier in the new space achieves better
generalization error.
Instead of separating the supervised and unsupervised criteria, we can instead have a generative
model of P(x) (or P(x, y)) which shares parameters with the discriminative model. The idea is to
share the unsupervised/generative criterion with the supervised criterion to express a prior belief
that the structure of P(x) (or P(x, y)) is connected to the structure of P(y|x), which is expressed
by the shared parameters.
7. Multitask Learning
The idea is to improve the generalization error by pooling together examples from multiple tasks.
60
٠
١
Similar to how more data leads to more generalization, using a part of the model for different
tasks constrains that part to learn good values. There are two types of model parameters:
Multitask learning leads to better generalization when there is actually some relationship
between the tasks, which actually happens in the context of Deep Learning where some of the
factors, which explain the variation observed in the data, are shared across different tasks.
8. Early Stopping
As mentioned at the start of the post, after a certain point of time during training, for a model
with extremely high representational capacity, the training error continues to decrease but the
validation error begins to increase (which we referred to as overfitting). In such a scenario, a
better idea would be to return back to the point where the validation error was the least. Thus, we
need to keep calculating the validation metric after each epoch and if there is any improvement,
we store that parameter setting. Upon termination of training, we return the last saved
parameters.
The idea of Early Stopping is that if the validation error doesn’t improve over a certain fixed
number of iterations, we terminate the algorithm. This effectively reduces the capacity of the
model by reducing the number of steps required to fit the model. The evaluation on the validation
61
٠
١
set can be done both on another GPU in parallel or done after the epoch. A drawback of weight
decay was that we had to manually tweak the weight decay coefficient, which, if chosen
wrongly, can lead the model to local minima by squashing the weight values too much. In Early
Stopping, no such parameter needs to be tweaked which reduces the number of hyperparameters
that we need to tune.
However, since we are setting aside some part of the training data for validation, we are not
using the complete training set. So, once Early Stopping is done, a second phase of training can
be done where the complete training set is used. There are two choices here:
Train from scratch for the same number of steps as in the Early Stopping case.
Use the weights learned from the first phase of training and retrain using the complete data.
Other than lowering the number of training steps, it reduces the computational cost also by
regularizing the model without having to add additional penalty terms. It affects the optimization
procedure by restricting it to a small volume of the parameter space, in the neighbourhood of the
initial parameters. Suppose 𝛕 and ϵ represent the number of iterations and the learning rate
respectively. Then, ϵ𝛕 effectively represents the capacity of the model. Intuitively, this can be
seen as the inverse of the weight decay co-efficient λ. When ϵ𝛕 is small (or λ is large), the
parameter space is small and vice versa. This equivalence holds true for a linear model with
quadratic cost function (initial parameters w⁰ = 0). Taking the Taylor Series Approximation of
J(w) around the empirically optimal weights w*:
multiplying with Q’ on both sides and using the fact that Q’Q = I (Q is orthonormal):
62
٠
١
Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case
of norm penalty. However, there might be situations where we might have some prior knowledge
on the kind of dependencies that the model should encode. Suppose, two models A and B,
perform a classification task on similar input and output distributions. In such a case, we’d
expect the parameters for both the models to be similar to each other as well. We could impose a
norm penalty on the distance between the weights, but a more popular method is to force the set
of parameters to be equal. This is the essence behind Parameter Sharing. A major benefit here is
that we need to store only a subset of the parameters (e.g. storing only the parameters for model
A instead of storing for both A and B) which leads to large memory savings. In the example of
Convolutional Neural Networks or CNNs (discussed in Chapter 9), the same feature is computed
across different regions of the image and hence, a cat is detected irrespective of whether it is at
position ior i+1 .
63
٠
١
Another idea could be to average the activation values across various examples and push it
towards some value. An example of getting representational sparsity by imposing hard constraint
on the activation value is the Orthogonal Matching Pursuit (OMP) algorithm, where a
representation h is learned for the input x by solving the constrained optimization problem:
The techniques which train multiple models and take the maximum vote across those models for
the final prediction are called ensemble methods. The idea is that it’s highly unlikely that
multiple models would make the same error in the test set.
Suppose that we have K regression models, with the model #i making an error ϵi on each
example, where ϵi is drawn from a zero mean, multivariate normal distribution such that:
𝔼(ϵi²)=v and 𝔼(ϵiϵj)=c. The error on each example is then the average across all the models: (∑
ϵi)/K.
The mean of this average error is 0 (as the mean of each of the individual ϵiϵi is 0). The variance
of the average error is given by:
64
٠
١
Thus, if c = v, then there is no change. If c = 0, then the variance of the average error decreases
with K. There are various ensembling techniques. In the case of Bagging (Bootstrap
Aggregating), the same training algorithm is used multiple times. The dataset is broken into K
parts by sampling with replacement (see figure below for clarity) and a model is trained on each
of those K parts. Because of sampling with replacement, the K parts have a few similarities as
well as a few differences. These differences cause the difference in the predictions of the K
models. Model averaging is a very strong technique.
65
٠
١
12. Dropout
In bagging, the models are independent of each other, whereas in dropout, the different models
share parameters, with each model taking as input, a sample of the total parameters.
In bagging, each model is trained till convergence, but in dropout, each model is trained for
just one step and the parameter sharing makes sure that subsequent updates ensure better
predictions in the future.
At test time, we combine the predictions of all the models. In the case of bagging with K models,
66
٠
١
this was given by the arithmetic mean. In case of dropout, the probability that a model is chosen
is given by p(μ), with μ denoting the mask vector. The prediction then becomes ∑ p(μ)p(y|x, μ).
This is not computationally feasible, and there’s a better method to compute this in one go, using
the geometric mean instead of the arithmetic mean.
We need to take care of two main things when working with geometric mean:
The advantage for dropout is that first term can be approximated in one pass of the complete
model by dividing the weight values by the keep probability (weight scaling inference rule). The
motivation behind this is to capture the right expected values from the output of each unit, i.e. the
total expected input to a unit at train time is equal to the total expected input at test time. A big
advantage of dropout then is that it doesn’t place any restriction on the type of model or training
procedure to use.
Points to note:
Reduces the representational capacity of the model and hence, the model should be large
enough to begin with.
Works better with more data.
Equivalent to L² for linear regression, with different weight decay coefficient for each input
feature.
Biological Interpretration:
During sexual reproduction, genes could be swapped between organisms if they are unable to
correctly adapt to the unusual features of any organism. Thus, the units in dropout learn to
perform well regardless of the presence of other hidden units, and also in many different
contexts.
Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g.
let’s assume that some unit learns to detect a nose in a face recognition task. Now, if this unit is
removed, then some other unit either learns to redundantly detect a nose or associates some other
feature (like mouth) for recognising a face. In either way, the model learns to make more use of
the information in the input. On the other hand, adding noise to the input won’t completely
removed the nose information, unless the noise is so large as to remove most of the information
from the input.
67
٠
١
13. Adversarial Training
Deep Learning has outperformed humans in the task of Image Recognition, which might lead us
to believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes,
shows otherwise. As shown in the image below, although the newly formed image (adversarial
image) looks almost exactly the same to a human, the model classifies it wrongly and that too
with very high confidence:
Adversarial training refers to training on images which are adversarially generated and it has
been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour
is the linearity of the model (say y = Wx), caused by the main building blocks being primarily
linear. Thus, a small change of ϵ in the input causes a drastic change of Wϵ in the output. The
idea of adversarial training is to avoid this jumping and induce the model to be locally constant
in the neighborhood of the training data.
This can also be used in semi-supervised learning. For an unlabelled sample x, we can assign the
label 礃Ȁ (x) using our model. Then, we find an adversarial example, x′, such that y(x′)≠礃Ȁ (x) (an
adversary found this way is called virtual adversarial example). The objective then is to assign
the same class to both x and x′. The idea behind this is that different classes are assumed to lie on
disconnected manifolds and a little push from one manifold shouldn’t land in any other manifold.
Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of
dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions. The idea of the tangent distance algorithm to find the K-nearest neighbors using the
distance metric as the distance between manifolds. A manifold Mi is approximated by the tangent
plane at Xi, hence, this technique needs tangent vectors to be specified.
68
٠
١
The
tangent prop algorithm proposed to learn a neural network based classifier, f(x), which is
invariant to known transformations causing the input to move along its manifold. Local
invariance would require that ▽ f(x) is perpendicular to the tangent vectors V(i). This can also
be achieved by adding a penalty term that minimizes the directional directive of f(x) along each
of the V(i).
It is similar to data augmentation in that both of them use prior knowledge of the domain to
specify various transformations that the model should be invariant to. However, tangent prop
only resists infinitesimal perturbations while data augmentation causes invariance to much larger
perturbations.
69
٠
١
UNIT – V: Optimization for Train Deep Models Challenges in Neural Network Optimization,
Basic Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates,
Approximate Second-Order Methods, Optimization Strategies and Meta-Algorithms Applications:
Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language Processing
The optimization problem for training neural networks is generally non-convex. Some of
the challenges faced are mentioned below:
• Ill-conditioning of the Hessian Matrix: The Hessian matrix and condition number have
been covered in our summary for Chapter 4. For the sake of completion, the Hessian
matrix H of a function f with a vector-valued input x is given as:
• Local minima: Nearly any Deep Learning (DL ) model is guaranteed to have an extremely
large number of local minima (LM) arising due to the model identifiability problem.
• Plateaus, Saddle Points and Other Flat Regions: Saddle point (SP) is another type of
point with zero gradient where some points around it have higher value and the others have
lower. Intuitively, this means that a saddle point acts as both a local minima for some
neighbors and a local maxima for the others. Thus, Hessian at SP has both positive and
negative eigenvalues (a very good explanation for this can be found here. TL;DR — for a
function to curve upwards or downwards around a point as in the case of local minima and
local maxima, the eigenvalues should have the same sign, positive for local minima and
negative for local maxima).
• Cliffs and Exploding Gradients: Neural Networks (NNs) might sometimes have
extremely steep regions resembling cliffs due to the repeated multiplication of weights.
Suppose we use a 3-layer (input-hidden-output) neural network with all the activation
functions as linear. We choose the same number of input, hidden and output neurons, thus,
using the same weight W for each layer. The output layer y = W*h where h =
W*x represents the hidden layer, finally giving y = W*W x. So, deep neural networks
involve multiplication of a large number of parameters leading to sharp non-linearities in
70
٠
١
the parameter space. These non-linearities give rise to high gradients in some places. At the
edge of such a cliff, an update step might throw the parameters extremely far.
Image depicting the problem of exploding gradients when approaching a cliff. 1) Usual
training going on with the parameters moving towards the lower cost region. 2) The gradient at
the bottom left-most point pointed downwards (correct direction) but the step-size was too large,
which caused the parameters to land at a point having large cost value. 3) The gradient at this new
point moved the parameters in a completely different position undoing most of the training done
until that point.
Thus, any eigenvalues not near an absolute value of one would either explode or vanish
leading to the Vanishing and Exploding Gradient problem. The use of the same weight matrix is
especially the case in Recurrent NNs (RNNs), where this is a serious problem.
Values near 1 either explode or vanish upon being compounded. You might have seen this
poster in a separate context.
71
٠
١
2. BASIC ALGORITHMS
• Stochastic Gradient Descent: This has already been described before but there are certain
things that should be kept in mind regarding SGD. The learning rate ϵ is a very important
parameter for SGD. ϵ should be reduced after each epoch in general. This is due to the fact
that the random sampling of batches acts as a source of noise which might make SGD keep
oscillating around the minima without actually reaching it. This is shown below:
The step size (earlier equal to learning rate * gradient) now depends on
how large and aligned the sequence of gradients are. If the gradients at each iteration point in the
same direction (say g), it will lead to a higher value of the step size as they just keep accumulating.
Once it reaches a constant (terminal) velocity, the step size becomes ϵ || g|| / (1 — α). Thus, using
α as 0.9 makes the speed 10 times. Common values of α are 0.5, 0.9 and 0.99.
Viewing it as the Newtonian dynamics of a particle sliding down a hill, the momentum
algorithm consists of solving a set of differential equations via numerical simulation. There are two
kinds of forces involved as shown below:
Momentum can be seen as two forces operating together. 1) Proportional to the negative
of the gradient such that whenever it descends a steep part of the surface, it gathers speed and
continues sliding in that direction until it goes uphill again. 2) A viscous drag force (friction)
72
٠
١
proportional to -v(t) without the presence of which the particle would keep oscillating back and
forth as the negative of the gradient would keep forcing it to move downhill . Viscous force is
suitable as it is weak enough to allow the gradient to cause motion and strong enough to resist any
motion if the gradient doesn’t justify moving
Read more about momentum in this excellent blog post by distill.ai: Why Momentum
Really Works.
• Nesterov Momentum: This is a slight modification of the usual momentum equation. Here,
the gradient is calculated after applying the current velocity to the parameters, which can be
viewed as adding a correction factor:
Training algorithms for deep learning models are iterative in nature and require the
specification of an initial point. This is extremely crucial as it often decides whether or not the
algorithm converges and if it does, then does the algorithm converge to a point with high cost or
low cost.
We have limited understanding of neural network optimization but the one property that we
know with complete certainty is that the initialization should break symmetry. This means that if
two hidden units are connected to the same input units, then these should have different initialization
or else the gradient would update both the units in the same way and we don’t learn anything new
by using an additional unit. The idea of having each unit learn something different motivates
random initialization of weights which is also computationally cheaper.
Biases are often chosen heuristically (zero mostly) and only the weights are randomly
initialized, almost always from a Gaussian or uniform distribution. The scale of the distribution is
of utmost concern. Large weights might have better symmetry-breaking effect but might lead to
chaos (extreme sensitivity to small perturbations in the input) and exploding values during forward
& back propagation. As an example of how large weights might lead to chaos, consider that there’s
a slight noise adding ϵ to the input. Now, we if did just a simple linear transformation like W * x,
the ϵ noise would add a factor of W * ϵ to the output. In case the weights are high, this ends up
making a significant contribution to the output. SGD and its variants tend to halt in areas near the
initial values, thereby expressing a prior that the path to the final parameters from the initial values
is discoverable by steepest descent algorithms. A more mathematical explanation for the symmetry
breaking can be found in the Appendix.
73
٠
١
Various suggestions have been made for appropriate initialization of the parameters. The
most commonly used ones include sampling the weights of each fully-connected layer
having m inputs and n outputs uniformly from the following distributions:
U(a, b) represents the uniform distribution where the probability of each value between a
and b, a and b inclusive, is 1/(b-a). The probability of every other value is 0.
These initializations have already been incorporated into the most commonly used Deep
Learning frameworks nowadays so that you can just specify which initializer to use and the
framework takes care of sampling appropriately. For e.g. Keras, which is a very famous deep
learning framework, has a module called initializers, where the second distribution (among the 2
mentioned above) is implemented as glorot_uniform .
One drawback of using 1 / √m as the standard deviation is that the weights end up being
small when a layer has too many input/output units. Motivated by the idea to have the total amount
of input to each unit independent of the number of input units m, Sparse initialization sets each
unit to have exactly k non-zero weights. However, it takes a long time for GD to correct incorrect
large values and hence, this initialization might cause problems.
If the weights are too small, the range of activations across the mini-batch will shrink as the
activations propagate forward through the network.By repeatedly identifying the first layer with
unacceptably small activations and increasing its weights, it is possible to eventually obtain a
network with reasonable initial activations throughout.
The biases are relatively easier to choose. Setting the biases to zero is compatible with most
weight initialization schemes except for a few cases for e.g. when used for an output unit, to prevent
saturation at initialization or when using unit as a gate for making a decision. Refer to the chapter for
details.
٠
١
several) epochs, a better approach is to adapt the learning rate as the training progresses.
This can be done by scaling the learning rates of each model parameter individually
inversely proportional to the square root of the sum of historical squared values of the
gradient. In the parameter update equation below, r is initialized with 0 and the
multiplication in the update step happens element-wise as mentioned. Since the gradient
value would be different for each parameter, the learning rate is scaled differently for each
parameter too. Thus, those parameters having a large gradient have a large decrease in the
learning rate as the learning rate might be too high leading to oscillations or it might be
approaching the minima but having large learning rate might cause it to jump over the
minima as explained in the figure below, because of which the learning rate should be
decreased for better convergence, while those with small gradients have a small decrease in
the learning rate as they might have already approached their respective minima and should
not be pushed away from that. Even if they have not, reducing the learning rate too much
would reduce the gradient even further leading to slower learning.
This figure illustrates the need to reduce the learning rate if gradient is large in case of a
single parameter. 1) One step of gradient descent representing a large gradient value. 2) Result of
reducing the learning rate — moves towards the minima 3) Scenario if the learning rate was not
reduced — it would have jumped over the minima.
However, accumulation of squared gradients from the very beginning can lead to excessive
and premature decrease in the learning rate. Consider that we had a model with only 2 parameters
(for simplicity) and both the initial gradients are 1000. After some iterations, the gradient of one of
the
75
٠
١
Figure explaining the problem with AdaGrad. Accumulated gradients can cause the
learning rate to be reduced far too much in the later stages leading to slower learning.
parameters has reduced to 100 but that of the other parameter is still around 750. However,
because of the accumulation at each update, the accumulated gradient would still have almost the
same value. For e.g. let the accumulated gradients at each step for the Parameter 1 be 1000 + 900 +
700 + 400 + 100 = 3100, 1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 +
750 = 4300, 1/4300 = 0.0002. This would lead to a similar decrease in the learning rates for both
the parameters, even though the parameter having the lower gradient might have its learning rate
reduced too much leading to slower learning.
ρ is the weighing used for exponential averaging. As more updates are made, the
contribution of past gradient values are reduced since ρ < 1 and ρ > ρ² >ρ³ …
This allows the algorithm to converge rapidly after finding a convex bowl, as if it were an
instance of AdaGrad initialized within that bowl. Let me explain why this is so. Consider the figure
below. The region represented by 1 indicates usual RMSProp parameter updates as given by the
update equation, which is nothing but exponentially averaged AdaGrad updates. Once the
optimization process lands on A, it essentially lands at the top of a convex bowl. At this point,
intuitively, all the updates before A can be seen to be forgotten due to the exponential averaging
and it can be seen as if (exponentially averaged) AdaGrad updates start from point A onwards.
Intuition behind RMSProp. 1) Usual parameter updates 2) Once it reaches the convex bowl,
exponentially weighted averaging would cause the effect of earlier gradients to reduce and to
simplify, we can assume their contribution to be zero. This can be seen as if AdaGrad had been
used with the training initiated inside the convex bowl
76
٠
١
• Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and
Momentum. Firstly, it views Momentum as an estimate of the first-order moment and
RMSProp as that of the second moment. The weight update for Adam is given by:
Secondly, since s and r are initialized as zeros, the authors observed a bias during the initial
steps of training thereby adding a correction term for both the moments to account for their
initialization near the origin. As an example of what the effect of this bias correction is, we’ll look
at the values of s and r for a single parameter (in which case everything is now represented as a
scalar). Let’s first understand what would happen if there was no bias correction. Since s (notice
that this is not in bold as we are looking at the value for a single parameter and the s here is a
scalar) is initialized as zero, after the first iteration, the value of s would be (1 — ρ1) * g and that
of r would be (1 — ρ2) * g². The preferred values for ρ1 and ρ2 are 0.9 and 0.99 respectively. Thus,
the initial values of s and r are pretty small and this gets compounded as the training progress.
However, if we now use bias correction, after the first iteration, the value of s is just g and that
of r is just g². This gets rid of the bias that occurs in the initial phase of training. A major advantage
of Adam is that it’s fairly robust to the choice of these hyperparameters, i.e. ρ1 and ρ2.
The figure below shows the comparison between the various optimization methods
discussed above. It can be clearly seen that algorithms with adaptive learning rates provide faster
convergence:
NAG here refers to Nesterov Accelerated Gradient which is the same as Nesterov
Momentum.
The optimization algorithms that we’ve looked at till now involved computing only the first
derivative. But there are many methods which involve higher order derivatives as well. The main
problem with these algorithms are that they are not practically feasible in their vanilla form and so,
certain methods are used to approximate the values of the derivatives. We explain three such
methods, all of which use empirical risk as the objective function:
77
٠
١
• Newton’s Method: This is the most common higher-order derivative method used. It makes
use of the curvature of the loss function via its second-order derivative to arrive at the
optimal point. Using the second-order Taylor Series expansion to approximate J(θ) around
a point θo and ignoring derivatives of order greater than 2 (this has already been discussed
in previous chapters), we get:
We know that we get a critical point for any function f(x) by solving for f'(x) = 0. We get
the following critical point of the above equation (refer to the Appendix for proof):
For quadratic surfaces (i.e. where cost function is quadratic), this directly gives the optimal
result in one step whereas gradient descent would still need to iterate. However, for surfaces that
are not quadratic, as long as the Hessian remains positive definite, we can obtain the optimal point
through a 2-step iterative process — 1) Get the inverse of the Hessian and 2) update the parameters.
Saddle points are problematic for Newton’s method. If all the eigenvalues are not positive,
Newton’s method might cause the updates to move in the wrong direction. A way to avoid this is
to add regularization:
However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in which case the Hessian
becomes dominated by the diagonal matrix. This leads to an update which becomes the standard
gradient divided by α:
Another problem restricting the use of Newton’s method is the computational cost. It
takes O(k³) time to calculate the inverse of the Hessian where k is the number of parameters. It’s
not uncommon for Deep Neural Networks to have about a million parameters and since the
parameters are updated every iteration, this inverse needs to be calculated at every iteration, which
is not computationally feasible.
78
٠
١
• Conjugate Gradients: One weakness of the method of steepest descent (i.e. GD) is that line
searches happen along the direction of the gradient. Suppose the previous search direction
is d(t-1). Once the search terminates (which it does when the gradient along the current
gradient direction vanishes) at the minimum, the next search direction, d(t) is given by the
gradient at that point, which is orthogonal to d(t-1) (because if it’s not orthogonal, it’ll have
some component along d(t-1) which cannot be true as at the minimum, the gradient along
d(t-1) has vanished).
Upon getting the minimum along the current search direction, the minimum along the
previous search direction is not preserved, undoing, in a sense, the progress made in
previous search direction.
In the method of conjugate gradients, we seek a search direction that is conjugate to the
previous line search direction:
Now, the previous search direction contributes towards finding the next search direction.
with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0. βt decides how much of d(t-1) is added
back to the current search direction. There are two popular choices for βt — Fletcher-Reeves and
Polak-Ribière. These discussions assumed the cost function to be quadratic where the conjugate
directions ensure that the gradient along the previous direction does not increase in magnitude. To
extend the concept to work for training neural networks, there is one additional change. Since it’s
no longer quadratic, there’s no guarantee anymore than the conjugate direction would preserve the
minimum in the previous search directions. Thus, the algorithm includes occasional resets where
the method of conjugate gradients is restarted with line search along the unaltered gradient.
• BFGS: This algorithm tries to bring the advantages of Newton’s method without the
additional computational burden by approximating the inverse of H by M(t), which is
iteratively refined using low-rank updates. Finally, line search is conducted along the
direction M(t)g(t). However, BFGS requires storing the matrix M(t) which takes O(n²)
memory making it infeasible. An approach called Limited Memory BFGS (L-BFGS) has
been proposed to tackle this infeasibility by computing the matrix M(t) using the same
method as BFGS but assuming that M(t−1) is the identity matrix.
79
٠
١
6. OPTIMIZATION STRATEGIES AND META-ALGORITHMS
• Batch Normalization: Batch normalization (BN) is one of the most exciting innovations in
Deep learning that has significantly stabilized the learning process and allowed faster
convergence rates. The intuition behind batch normalization is as follows: Most of the Deep
Learning networks are compositions of many layers (or functions) and the gradient with
respect to one layer is taken considering the other layers to be constant. However, in practise
all the layers are updated simultaneously and this can lead to unexpected results. For
example, let y* = x W¹ W² … W¹⁰. Here, y* is a linear function of x but not a linear function
of the weights. Suppose the gradient is given by g and we now intend to reduce y* by 0.1.
Using first-order Taylor Series approximation, taking a step of ϵg would reduce y* by ϵg’
g. Thus, ϵ should be 0.1/(g’ g) just using the first-order information. However, higher order
effects also creep in as the updated y* is given by:
Going back to the earlier example of y*, let the activations of layer l be given by h(l-1).
Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit Gaussian, then h(l-1) also comes
from a Gaussian, however, not of zero mean and unit variance, as it is a linear transformation of x.
BN makes it zero mean and unit variance. Therefore, y* = Wl h(l-1) and thus, the learning now
becomes much simpler as the parameters at the lower layers mostly do not have any effect. This
simplicity was definitely achieved by rendering the lower layers useless. However, in a realistic
deep network with non-linearities, the lower layers remain useful. Finally, the complete
reparameterization of BN is given by replacing H with γH’ + β. This is done to retain its expressive
power and the fact that the mean is solely determined by XW. Also, among the choice of
normalizating X or XW + B, the authors recommend the latter, specifically XW, since B becomes
redundant because of β. Practically, this means that when we are using the Batch Normalization
layer, the biases should be turned off. In a deep learning framework like Keras, this can be done by
setting the parameter use_bias=False in the Convolutional layer.
80
٠
١
• Coordinate Descent: Generally, a single weight update is made by taking the gradient with
respect to every parameter. However, in cases where some of the parameters might be
independent (discussed below) of the remaining, it might be more efficient to take the
gradient with respect to those independent sets of parameters separately for making updates.
Let me clarify that with an example. Suppose we have the following cost function:
This cost function describes the learning problem called sparse coding. Here, H refers to the
sparse representation of X and W is the set of weights used to linearly decode H to retrieve X. An
explanation of why this cost function enforces the learning of a sparse representation of X follows.
The first term of the cost function penalizes values far from 0 (positive or negative because of the
modulus, |H|, operator. This enforces most of the values to be 0, thereby sparse. The second term is
pretty self-explanatory in that it compensates the difference between X and H being linearly
transformed by W, thereby enforcing them to take the same value. In this way, H is now learned as
a sparse “representation” of X. The cost function generally consists of additionally
a regularization term like weight decay, which has been avoided for simplicity. Here, we can divide
the entire list of parameters into two sets, W and H. Minimizing the cost function with respect to
any of these sets of parameters is a convex problem. Coordinate Descent (CD) refers to
minimizing the cost function with respect to only 1 parameter at a time. It has been shown that
repeatedly cycling through all the parameters, we are guaranteed to arrive at a local minima. If
instead of 1 parameter, we take a set of parameters as we did before with W and H, it is called block
coordinate descent (the interested reader should explore Alternating Minimization). CD makes
sense if either the parameters are clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with respect to others.
The points A, B, C and D indicates the locations in the parameter space where coordinate
descent landed after each gradient step.
Coordinate descent may fail terribly when one variable influences the optimal value of
another variable.
81
٠
١
• Polyak Averaging: Polyak averaging consists of averaging several points in the parameter
space that the optimization algorithm traverses through. So, if the algorithm encounters the
points θ(1), θ(2), … during optimization, the output of Polyak averaging is:
The optimization algorithm might oscillate back and forth across a valley without ever
reaching the minima. However, the average of those points should be closer to the bottom of the
valley.
Most optimization problems in deep learning are non-convex where the path taken by the
optimization algorithm is quite complicated and it might happen that a point visited in the distant
past might be quite far from the current point in the parameter space. Thus, including such a point
in the distant past might not be useful, which is why an exponentially decaying running average is
taken. This scheme where the recent iterates are weighted more than the past ones is called Polyak-
Ruppert Averaging:
• Supervised Pre-training: Sometimes it’s hard to directly train to solve for a specific task.
Instead it might be better to train for solving a simpler task and use that as an initialization
point for training to solve the more challenging task. As an intuition for why this seems
logical, consider that you didn’t have any background in integration and were asked to learn
how to compute the following integral:
82
٠
١
If you’re anyone close to a normal person, your first reaction would be:However, wouldn’t
it be better if you were asked to understand the more basic integrations first:
I hope you understand what I meant with this example — learning a simpler task would put
you in a better position to understand the more complex task. This particular strategy of training to
solve a simpler task before facing the herculean one is called pretraining. A particular type of
pretraining, called greedy supervised pretraining, firstly breaks a given supervised learning
problem into simpler supervised learning ones and solving for the optimal version of each
component in isolation. To build on the above intuition, the hypothesis as to why this works is that
it gives better guidance to the intermediate layers of the network and helps in both, generalization
and optimization. More often that not, the greedy pretraining is followed by a fine-tuning stage
where all the parts are jointly optimized to search for the optimal solution to the full problem. As
an example, the figure below shows how each hidden layer is trained one at a time, where the input
to the hidden layer being learned is the output of the previously trained hidden layer.
83
٠
١
Greedy supervised pretraining (a) The first hidden layer is being trained only using the
original inputs and outputs. (b) For training the second hidden layer, the hidden-output connection
from the first hidden layer is removed and the output of the first hidden layer is used as the input.
Also, FitNets shows an alternative way to guide the training process. Deep networks are
hard to train mainly because as deeper the model gets, more non-linearities are introduced. The
authors propose the use of a shallower and wider teacher network that is trained first. Then, a
second network which is thinner and deeper, called the student network is trained to predict not
only the final outputs but also the intermediate layers of the teacher network. For those who might
not be clear with what deep, shallow, wide and thin might mean, refer the following diagram:
Explanation of the terms “shallow”, “deep”, “thin” and “wide” in the context of neural
networks.
The idea is that predicting the intermediate layers of the teacher network provides some
hints as to how the layers of the student network should be used and aids the optimization procedure.
It was shown that without the hints to the hidden layers, the students networks performs poorly in
both the training and test data.
• Designing Models to Aid Optimization: Most of the work in deep learning has been
towards making the models easier to optimize rather than designing a more powerful
optimization algorithm. Also, linear functions increase in a particular direction. Thus, if
84
٠
١
there’s an error, there’s a clear direction towards which the output should move to minimize
the error.
APPLICATIONS
1. LARGE-SCALE DEEP LEARNING
• Philosophy of connectionism
– While an individual neuron/feature is not intelligent, a large no. acting together
can exhibit intelligent behavior
– No. of neurons must be large
• Although network sizes have increased exponentially in three decades, ANNs are only as
large as nervous systems of insects
• Since size is important, DL requires highperformance hardware and software infrastructure
2. COMPUTER VISION
• Computer Vision is one of the most active areas for deep learning research, since
– Vision is a task effortless for humans but difficult for computers
• Standard benchmarks for deep learning algorithms are:
– object recognition
– OCR
• Computer vision requires little preprocessing
– Pixel range
• Images should be standardized, so pixels lie in same range [0,1], [-1,1], or [0,255] etc
– Picture size
• Some architectures need a standard size. So images may need to be scaled
• May not be needed with convolutional models which dynamically adjust size of pooling
regions
– Data set augmentation
• Can be seen as a preprocessing step for training set
Large-scale automatic speech recognition is the first and most convincing successful case
of deep learning. LSTM RNNs can learn "Very Deep Learning" tasks that involve multi-second
intervals containing speech events separated by thousands of discrete time steps, where one time
85
٠
١
step corresponds to about 10 ms. LSTM with forget gates is competitive with traditional speech
recognizers on certain tasks.
The initial success in speech recognition was based on small-scale recognition tasks based
on TIMIT. The data set contains 630 speakers from eight major dialects of American English,
where each speaker reads 10 sentences. Its small size lets many configurations be tried. More
importantly, the TIMIT task concerns phone-sequence recognition, which, unlike word-sequence
recognition, allows weak phone bigram language models. This lets the strength of the acoustic
modeling aspects of speech recognition be more easily analyzed. The error rates listed below,
including these early results and measured as percent phone error rates (PER), have been
summarized since 1991.
The debut of DNNs for speaker recognition in the late 1990s and speech recognition around
2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:
86
٠
١
logistic regression or SVM were used to build time-consuming complex models but now
distributed representations, convolution neural networks, recurrent and recursive neural networks,
reinforcement learning, and memory augmenting strategies are helping achieve greater maturity
in NLP. Distributed representations are particularly effective in producing linear semantic
relationships used to build phrases and sentences and capturing local word semantics with word
embedding (word embedding entails the meaning of a word being defined in the context of its
neighboring words).
87