Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

neural-networks-and-deep-learning-notes

The document provides comprehensive notes on Neural Networks and Deep Learning, covering various units including Artificial Neural Networks, Unsupervised Learning Networks, Deep Learning fundamentals, Regularization techniques, and Optimization strategies. It details important concepts such as supervised and unsupervised learning, different types of neural network architectures, and training algorithms like Perceptron and Back Propagation Networks. Additionally, it discusses applications of deep learning in fields like computer vision and natural language processing.

Uploaded by

internhub.info
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

neural-networks-and-deep-learning-notes

The document provides comprehensive notes on Neural Networks and Deep Learning, covering various units including Artificial Neural Networks, Unsupervised Learning Networks, Deep Learning fundamentals, Regularization techniques, and Optimization strategies. It details important concepts such as supervised and unsupervised learning, different types of neural network architectures, and training algorithms like Perceptron and Back Propagation Networks. Additionally, it discusses applications of deep learning in fields like computer vision and natural language processing.

Uploaded by

internhub.info
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

lOMoARcPSD|34321553

Neural Networks AND DEEP Learning- Notes

Neural Networks & Deep Learning (Jawaharlal Nehru Technological University,


Hyderabad)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Raghu Polishetti (ragspol@gmail.com)
lOMoARcPSD|34321553

NEURAL NETWORKS AND DEEP LEARNING

UNIT – I: Artificial Neural Networks Introduction, Basic models of ANN, important


terminologies, Supervised Learning Networks, Perceptron Networks, Adaptive Linear Neuron,
Back propagation Network. Associative Memory Networks. Training Algorithms for pattern
association, BAM and Hopfield Networks.

UNIT - II: Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets,
Maxnet, Hamming Network, Kohonen Self-Organizing Feature Maps, Learning Vector
Quantization, Counter Propagation Networks, Adaptive Resonance Theory Networks. Special
Networks Introduction to various networks.

UNIT - III : Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms .

UNIT - IV :Regularization for Deep Learning Parameter norm Penalties, Norm Penalties as
Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early
Stopping, Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and
other Ensemble Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop and
Manifold, Tangent Classifier

UNIT – V: Optimization for Train Deep Models Challenges in Neural Network Optimization,
Basic Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning
Rates, Approximate Second-Order Methods, Optimization Strategies and Meta-Algorithms
Applications: Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural
Language Processing

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

UNIT - I : Artificial Neural Networks Introduction, Basic models of ANN, important


terminologies, Supervised Learning Networks, Perceptron Networks, Adaptive Linear Neuron,
Back propagation Network. Associative Memory Networks. Training Algorithms for pattern
association, BAM and Hopfield Networks.

➢ Artificial Neural Networks:


An artificial neuron network (ANN) is a computational model based on the structure
and functions of biological neural networks. ANNs are considered nonlinear
statistical data modeling tools where the complex relationships between inputs and
outputs are modeled or patterns are found.
The Artificial Neural Network receives the input signal from the external
world in the form of a pattern and image in the form of a vector. ... Each of the input is
then multiplied by its corresponding weights (these weights are the details used by
the artificial neural networks to solve a certain problem).
****
➢ Basic Models of ANN:
The models of ANN are specified by the three basic entities namely:

1. The model's synaptic interconnection.


2. The training rules or learning rules adopted for updating and adjusting the connection
weights.
3. Their activation functions.

1. Connections:-
An ANN consists of a set of highly interconnected processing elements such that each
processing element's output is found to be connected through weights to the other processing
elements or to itself; delay leads and lag-free connections are allowed. Hence, the arrangement
of these processing elements and the geometry of their interconnections are essential for an
ANN. The point where the connection originates and terminate should be noted, and the
function of each processing element in an ANN should be specified.
The arrangement of neurons to form layers and connection pattern formed within and between
layers is called the network architecture.
There are five basic types of neuron connection architectures:-

1. Single layer feed forward network.


2. Multilayer feed forward network
3. Single node with its own feedback
4. Single layer recurrent network
5. Multilayer recurrent network

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

1. Single layer feed forward network

Eg:- Single Layer feed forward Network


A layer is formed by taking a processing element and combining it with other processing
elements. When a layer of the processing nodes is formed, the inputs can be connected to these
nodes with various weights, resulting in a series of outputs, one per node.
Thus, a single layer feed forward network is formed.
2. Multilayer feed forward network

Fig: Multilayer feed forward Network

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

A multilayer feed forward network is formed by the interconnection of several layers. The input
layer is that which receives the input and this layer has no function except buffering the input
signal. The output layer generates the output of the network. Any layer that is formed between
the input layer and the output layer is called the hidden layer.

3. Single node with its own feedback

If the feedback of the output of the processing elements is directed back as an input to the
processing elements in the same layer then it is called lateral feedback.

Competitive Net
The competitive interconnections have fixed weight-εε. This net is called Maxnet and we will
study in the Unsupervised learning network Category.

4. Single layer recurrent network

Fig: - Single Layer Recurrent Network

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Recurrent networks are the feedback networks with a closed loop.


5. Multilayer recurrent network

6. Lateral inhibition structure

****
➢ Important Terminologies :
The field of artificial neural networks has developed alongside many disciplines, such
as neurobiology, mathematics, statistics, economics, computer science, engineering and
physics, to mention but a few. Consequently, the terminology used in the field varies
from discipline to discipline. We present four of them.
1. Activation Function: Algorithm for computing the activation value of a neurode
as a function of its net input. Net input is typically the sum of weighted inputs to
the neurode.
2. Feed forward Network: Network ordered into layers with no feedback paths. The
lowest layer is the input layer, the highest is the output layer. The outputs of a given
layer go only to higher layers and its inputs come only from lower layers.
3. Supervised Learning: Learning procedure in which a network is presented with a
set of input pattern and target pairs. The network can compare its output to the target
and adapt itself according to the learning rules.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

4. Unsupervised Learning: Learning procedure in which the network is presented


with a set of input patterns. The network adapts itself according to the statistical
associations in the input patterns.
****
➢ Supervised Learning Networks :
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine using data
which is well labeled that means some data is already tagged with the correct answer. After
that, the machine is provided with a new set of examples (data) so that supervised learning
algorithm analyses the training data(set of training examples) and produces a correct outcome
from labeled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all different fruits one by one like this:

• If shape of object is rounded and depression at top having color Red then it will be
labelled as –Apple.
• If shape of object is long curving cylinder having color Green-Yellow then it will be
labelled as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from
basket and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in Banana category. Thus the machine learns the things from training
data(basket containing fruits) and then apply the knowledge to test data(new fruit).
Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such
as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
****
➢ Perceptron Network :

Developed by Frank Rosenblatt by using McCulloch and Pitts model, perceptron is the
basic operational unit of artificial neural networks. It employs supervised learning rule and is
able to classify the data into two classes.
Operational characteristics of the perceptron: It consists of a single neuron with an arbitrary
number of inputs along with adjustable weights, but the output of the neuron is 1 or 0
depending upon the threshold. It also consists of a bias whose weight is always 1. Following
figure gives a schematic representation of the perceptron.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Perceptron thus has the following three basic elements −


• Links − It would have a set of connection links, which carries a weight including a
bias always having weight 1.
• Adder − It adds the input after they are multiplied with their respective weights.
• Activation function − It limits the output of neuron. The most basic activation function
is a Heaviside step function that has two possible outputs. This function returns 1, if
the input is positive, and 0 for any negative input.

Training Algorithm

Perceptron network can be trained for single output unit as well as multiple output units.

• Training Algorithm for Single Output Unit

Step 1 − Initialize the following to start the training −

• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si (i=1to n)
Step 5 − Now obtain the net input with the following relation −
Yin=b+

Here ‘b’ is bias and ‘n’ is the total number of input neurons.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Step 6 − Apply the following activation function to obtain the final output.
f(Yin)= 1 if Yin> θ
0 if – θ Yin θ
-1 if Yin< - θ

Step 7 − Adjust the weight and bias as follows −


Case 1 − if y ≠ t then,

Wi(new)=Wi(old)+αtxi
b(new)=b(old)+αt
Case 2 − if y = t then,
Wi(new)=Wi(old)
b(new)=b(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Step 8 − Test for the stopping condition, which would happen when there is no change in
weight.

• Training Algorithm for Multiple Output Units

The following diagram is the architecture of perceptron for multiple output classes.

Step 1 − Initialize the following to start the training −

• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si (i=1 to n)
Step 5 − Obtain the net input with the following relation −
Yin=b+

Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output for each output
unit j = 1 to m −
f(Yin)= 1 if Yinj>θ
0 if – θ<=Yinj<= θ
-1 if Yinj< - θ

Step 7 − Adjust the weight and bias for x = 1 to n and j = 1 to m as follows −


Case 1 − if yj ≠ tj then,
Wij(new)=Wij(old)+αtjxi
bj(new)=bj(old)+αtj
Case 2 − if yj = tj then,
Wij(new)=Wij(old)
bj(new)=bj(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Step 8 − Test for the stopping condition, which will happen when there is no change in weight.
****
➢ Adaptive Linear Neuron :

Adaline which stands for Adaptive Linear Neuron, is a network having a single linear unit. It
was developed by Widrow and Hoff in 1960. Some important points about Adaline are as
follows −
• It uses bipolar activation function.
• It uses delta rule for training to minimize the Mean-Squared Error MSEMSE between
the actual output and the desired/target output.
• The weights and the bias are adjustable.

Architecture:

The basic structure of Adaline is similar to perceptron having an extra feedback loop with the
help of which the actual output is compared with the desired/target output. After comparison
on the basis of training algorithm, the weights and bias will be updated.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Training Algorithm

Step 1 − Initialize the following to start the training −

• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the learning
rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every bipolar training pair s:t.
Step 4 − Activate each input unit as follows −
xi=si (i=1 to n)

Step 5 − Obtain the net input with the following relation −


Yin=b+

Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output −
f(Yin)= 1 if Yin 0

-1 if Yin <0

Step 7 − Adjust the weight and bias as follows −


Case 1 − if y ≠ t then,
Wi(new)=Wi(old)+α(t−Yin)Xi

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

b(new)=b(old)+α(t−Yin)

Case 2 − if y = t then,
Wi(new)=Wi(old)
b(new)=b(old)

Here ‘y’ is the actual output and ‘t’ is the desired/target output.
(t−Yin) is the computed error.
Step 8 − Test for the stopping condition, which will happen when there is no change in weight
or the highest weight change occurred during training is smaller than the specified tolerance.
****
➢ Back Propogation Network :
Back Propagation Neural BPNBPN is a multilayer neural network consisting of the input
layer, at least one hidden layer and output layer. As its name suggests, back propagating will
take place in this network. The error which is calculated at the output layer, by comparing the
target output and the actual output, will be propagated back towards the input layer.

Architecture

As shown in the diagram, the architecture of BPN has three interconnected layers having
weights on them. The hidden layer as well as the output layer also has bias, whose weight is
always 1, on them. As is clear from the diagram, the working of BPN is in two phases. One
phase sends the signal from the input layer to the output layer, and the other phase back
propagates the error from the output layer to the input layer.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Training Algorithm

For training, BPN will use binary sigmoid activation function. The training of BPN will have
the following three phases.
• Phase 1 − Feed Forward Phase
• Phase 2 − Back Propagation of error
• Phase 3 − Updating of weights
All these steps will be concluded in the algorithm as follows
Step 1 − Initialize the following to start the training −

• Weights
• Learning rate αα
For easy calculation and simplicity, take some small random values.
Step 2 − Continue step 3-11 when the stopping condition is not true.
Step 3 − Continue step 4-10 for every training pair.

Phase 1

Step 4 − Each input unit receives input signal xi and sends it to the hidden unit for all i = 1 to
n

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Step 5 − Calculate the net input at the hidden unit using the following relation −
Qinj=b0j+

Here b0j is the bias on hidden unit, vij is the weight on j unit of the hidden layer coming
from i unit of the input layer.
Now calculate the net output by applying the following activation function
Qj=f(Qinj)
Send these output signals of the hidden layer units to the output layer units.
Step 6 − Calculate the net input at the output layer unit using the following relation −
Yink= b0k+

Here b0k ⁡is the bias on output unit, wjk is the weight on k unit of the output layer coming
from j unit of the hidden layer.
Calculate the net output by applying the following activation function
Yk=f(Yink)

Phase 2

Step 7 − Compute the error correcting term, in correspondence with the target pattern received
at each output unit, as follows −
δk=(tk−Yk)f′(Yink)
On this basis, update the weight and bias as follows −
ΔVjk=αδkQij
Δb0k=αδk
Then, send δkδk back to the hidden layer.
Step 8 − Now each hidden unit will be the sum of its delta inputs from the output units.
δinj=

Error term can be calculated as follows −


δj=δinjf′(Qinj)
On this basis, update the weight and bias as follows −
ΔWij=αδjxi
Δb0j=αδj

Phase 3

Step 9 − Each output unit (ykk = 1 to m) updates the weight and bias as follows −
Vjk(new)=Vjk(old)+ΔVjk

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

b0k(new)=b0k(old)+Δb0k
Step 10 − Each output unit (zjj = 1 to p) updates the weight and bias as follows −
Wij(new)=Wij(old)+ΔWij
b0j(new)=b0j(old)+Δb0j
Step 11 − Check for the stopping condition, which may be either the number of epochs
reached or the target output matches the actual output.
****

➢ Associate Memory Networks :


These kinds of neural networks work on the basis of pattern association, which means they
can store different patterns and at the time of giving an output they can produce one of the
stored patterns by matching them with the given input pattern. These types of memories are
also called Content-Addressable Memory CAMCAM. Associative memory makes a
parallel search with the stored patterns as data files.
Following are the two types of associative memories we can observe −

• Auto Associative Memory


• Hetero Associative memory

1. Auto Associative Memory

This is a single layer neural network in which the input training vector and the output target
vectors are the same. The weights are determined so that the network stores a set of patterns.

Architecture

As shown in the following figure, the architecture of Auto Associative memory network
has ‘n’ number of input training vectors and similar ‘n’ number of output target vectors.

Training Algorithm

For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1 to n, j=1 to n

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Step 2 − Perform steps 3-4 for each input vector.


Step 3 − Activate each input unit as follows −
Xi=Si(i=1 to n)

Step 4 − Activate each output unit as follows −


Yj=Sj (j=1 to n)

Step 5 − Adjust the weights as follows −


Wij(new)=Wij(old)+XiYj

Testing Algorithm

Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
Yinj=

Step 5 − Apply the following activation function to calculate the output


Yj=f(Yinj)= +1 if Yinj > 0
-1 if Yinj 0

2. Hetero Associative memory

Similar to Auto Associative Memory network, this is also a single layer neural network.
However, in this network the input training vector and the output target vectors are not the
same. The weights are determined so that the network stores a set of patterns. Hetero
associative network is static in nature, hence, there would be no non-linear and delay
operations.

Architecture

As shown in the following figure, the architecture of Hetero Associative Memory network
has ‘n’ number of input training vectors and ‘m’ number of output target vectors.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Training Algorithm

For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1 to n, j=1 to m
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
Xi=Si (i=1 to n)

Step 4 − Activate each output unit as follows −


Yj=Sj (j=1 to m)

Step 5 − Adjust the weights as follows −


Wij(new)=Wij(old)+XiYj

Testing Algorithm

Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
Yinj=

Step 5 − Apply the following activation function to calculate the output


Yj= f(Yinj)= +1 if Yinj >0
0 if Yinj = 0
-1 if Yinj < 0
****

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

➢ Training Algorithm For Pattern Association:


1- Hebb Rule for Pattern Association: -
The Hebb rule is the simplest and most common method of determining the
weights for an associative memory neural net. - we denote our training vector pairs
(input training-target output vectors) as s: t. We then denote our testing input vector as
x, which may or may not be the same as one of the training input vectors. - In the
training algorithm of hebb rule the weights initially adjusted to 0, then updated using
the following formula:
Wij(new) = Wij(old)+ XiYj ; (i = 1, . . . , n; j = 1, . . . , m):
where, Xi = Si
Yj = tj
Outer products:
The weights found by using the Hebb rule (with all weights initially 0) can also
be described in terms of outer products of the input vector-output vector pairs s:t. The
outer product of two vectors
s = (S1, ……., Si, ……., Sn) ; t = (t1, ……., tj, ……., tm)
w = s^T t
To store a set of associations s(p) : t(p), p = 1, . . . , P, where
s(p) = (s1(p), …., Si(p), …., Sn(p)) ;
t(p) = (t1(p), ……., tj(p), ……., tm(p))
Wij= tj(p)

This is the sum of the outer product matrices required to store each association
separately. In general, we shall use the preceding formula or the more concise vector
matrix form,
W= t(p)

Several authors normalize the weights found by the Hebb rule by a factor of 1/n, where
n is the number of units in the system
2- Delta Rule for Pattern Association :
In its original form, the delta rule assumed that the activation function for the
output unit was the identity function. Thus, using y for the computed output for the
input vector x, we have
Yj=netJ=

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

The weights can be updated using the following equation:


∆Wij = α (tj – yj) Xi
A simple extension allows for the use of any differentiable activation function; we shall
call this the extended delta rule. The update for the weight from the I’th input unit to
the J’th output unit is:
∆WIJ = α (tJ – yJ) xI f ʹ(netJ)
****

➢ BAM and Hopfiled Networks:


BAM:
Bidirectional associative memory (BAM) is a type of recurrent neural network. BAM
was introduced by Bart Kosko in 1988. There are two types of associative memory,
auto-associative and hetero-associative. BAM is hetero-associative, meaning given a
pattern it can return another pattern which is potentially of a different size. It is similar
to the Hopfield network in that they are both forms of associative memory. However,
Hopfield nets return patterns of the same size.

Topology

A BAM contains two layers of neurons, which we shall denote X and Y. Layers X and Y are
fully connected to each other. Once the weights have been established, input into layer X
presents the pattern in layer Y, and vice versa

Procedure

Learning
Imagine we wish to store two associations, A1:B1 and A2:B2.

• A1 = (1, 0, 1, 0, 1, 0), B1 = (1, 1, 0, 0)


• A2 = (1, 1, 1, 0, 0, 0), B2 = (1, 0, 1, 0)
These are then transformed into the bipolar forms:

• X1 = (1, -1, 1, -1, 1, -1), Y1 = (1, 1, -1, -1)


• X2 = (1, 1, 1, -1, -1, -1), Y2 = (1, -1, 1, -1)
From there, we calculate M= where Xi^T denotes the transpose. So,

M= 2 0 0 -2
0 -2 2 0
2 0 0 -2

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

-2 0 0 2
0 2 -2 0
-2 0 0 2
Hopfield Networks:
Hopfield neural network was invented by Dr. John J. Hopfield in 1982. It consists of a single
layer which contains one or more fully connected recurrent neurons. The Hopfield network is
commonly used for auto-association and optimization tasks.
1.Discrete Hopfield Network:
A Hopfield network which operates in a discrete line fashion or in other words, it can be said
the input and output patterns are discrete vector, which can be either binary 0,10,1 or
bipolar +1,−1+1,−1 in nature. The network has symmetrical weights with no self-connections
i.e., wij = wji and wii = 0.

Architecture

Following are some important points to keep in mind about discrete Hopfield network −
• This model consists of neurons with one inverting and one non-inverting output.
• The output of each neuron should be the input of other neurons but not the input of
self.
• Weight/connection strength is represented by wij.
• Connections can be excitatory as well as inhibitory. It would be excitatory, if the output
of the neuron is same as the input, otherwise inhibitory.
• Weights should be symmetrical, i.e. wij = wji

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

The output from Y1 going to Y2, Yi and Yn have the weights w12, w1i and w1n respectively.
Similarly, other arcs have the weights on them.

Training Algorithm

During training of discrete Hopfield network, weights will be updated. As we know that we
can have the binary input vectors as well as bipolar input vectors. Hence, in both the cases,
weight updates can be done with the following relation
Case 1 − Binary input patterns
For a set of binary patterns sp, p = 1 to P
Here, sp = s1p, s2p,..., sip,..., snp
Weight Matrix is given by
Wij= for i j

Case 2 − Bipolar input patterns


For a set of binary patterns sp, p = 1 to P
Here, sp = s1p , s2p,..., sip,..., snp
Weight Matrix is given by
Wij= for i j

Testing Algorithm

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Step 1 − Initialize the weights, which are obtained from training algorithm by using Hebbian
principle.
Step 2 − Perform steps 3-9, if the activations of the network is not consolidated.
Step 3 − For each input vector X, perform steps 4-8.
Step 4 − Make initial activation of the network equal to the external input vector X as follows

Yi=Xi for i=1 to n
Step 5 − For each unit Yi, perform steps 6-9
Step 6 − Calculate the net input of the network as follows −
Yini=xi+

Step 7 − Apply the activation as follows over the net input to calculate the output −
Yi = 1 if Yini >θi
Yi if Yini = θi
0 if Yini < θi
Here θi is the threshold.
Step 8 − Broadcast this output yi to all other units.
Step 9 − Test the network for conjunction.
2.Continuous Hopfield Network

In comparison with Discrete Hopfield network, continuous network has time as a continuous
variable. It is also used in auto association and optimization problems such as travelling
salesman problem.
Model − The model or architecture can be build up by adding electrical components such as
amplifiers which can map the input voltage to the output voltage over a sigmoid activation
function.

Energy Function Evaluation

Ef= – + gri (y)dy

THE END

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

UNIT – II
Unsupervised Learning Network- Introduction, Fixed Weight Competitive Nets, Maxnet,
Hamming Network, Kohonen Self-Organizing Feature Maps, Learning Vector Quantization,
Counter Propagation Networks, Adaptive Resonance Theory Networks. Special Networks
Introduction to various networks.
➢ Introduction

Unsupervised learning is a type of machine learning that looks for previously undetected
patterns in a data set with no pre-existing labels and with a minimum of human supervision. In
contrast to supervised learning that usually makes use of human-labeled data, unsupervised
learning, also known as self-organization allows for modeling of probability densities over
inputs. It forms one of the three main categories of machine learning, along
with supervised and reinforcement learning. Semi-supervised learning, a related variant,
makes use of supervised and unsupervised techniques.
Two of the main methods used in unsupervised learning are principal
component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or
segment, datasets with shared attributes in order to extrapolate algorithmic
relationships. Cluster analysis is a branch of machine learning that groups the data that has not
been labelled, classified or categorized. Instead of responding to feedback, cluster analysis
identifies commonalities in the data and reacts based on the presence or absence of such
commonalities in each new piece of data. This approach helps detect anomalous data points
that do not fit into either group.

➢ Fixed weight competitive nets


They are additional structures included in networks of multi output in order
to force their output layers to make a decision as to which one neuron will fire.
This mechanism is called competition. When competition is complete, only one
output neuron has nonzero output. Symmetric (fixed) weight nets are: (Maxnet and
Hamming Net).

1- Maxnet

• -Maxnet is based on winner-take-all policy.


• -The n-nodes of Maxnet are completely connected
• -There is no need for training the network, since the weights are fixed.
• -The Maxnet operates as a recurrent recall network that operates in an

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Auxiliary mode
• Activation functions

net if net > 0


f (net) =
0 otherwise

Where Є is usually positive less than 1 number.

Maxnet
architecture

є
1 n
є є
є

i є j

Maxnet Algorithm

Step 1: Set activations and weights,


aj (0) is the starting input value to node Aj

1 for i = j
ωij =
-є i≠j
Step 2: If more than one node has nonzero output, do step 3 to 5.
Step 3: Update the activation (output) at each node for
j = 1, 2, 3……., n
aj (t+1) = f [ aj (t) – є ∑ ai (t)] i ≠ j

є < 1/m where m is the number of competitive neurons

Step 4: Save activations for use in the next iteration.


aj (t+1) → aj (t)

Step 5: Test for stopping condition. If more than one node has a nonzero output then

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Go To step 3, Else Stop.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Example: A Maxnet has three inhibitory weights a 0.25 (є = 0.25). The net is initially

activated by the input signals [0.1 0.3 0.9]. The activation function of the neurons is:
1
1

net if net > 0 f 1 0.25 2

(net) = 0 otherwise
0.25 0.25
Find the final winning neuron. 3

Solution:

First iteration: The net values are:


a1 (1) = f [0.1 - 0.25(0.3+0.9)] = 0
a2 (1) = f [0.3 - 0.25(0.1+0.9)] = 0.05
a3 (1) = f [0.9 - 0.25(0.1+0.3)] = 0.8

Second iteration: a1 (2) = f [0 - 0.25(0.05+0.8)] = 0


a2 (2) = f [0.05 - 0.25(0 +0.8)] = 0
a3 (2) = f [0.8 -0.25(0+0.05)] =0.7875

Then the 3rd neuron is the winner.

2.Hamming Net:

Hamming net is a maximum likelihood classifier net. It is used to determine an


exemplar vector which is most similar to an input vector. The measure of similarity is
obtained from the formula:
x.y = a – D = 2a – n , since a +D = n
Where D is the hamming distance (number of component in which vectors differ), a is
the number components in which the components agree and n is the number of each
vector components.
When weight vector off a class unit is set to be one half of the exemplar vector,
and bias to be (n/2), the net will find the unit closest exemplar by finding the unit with
maximum net input. Maxnet is used for this purpose.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

Hamming Net
Structure
1
Y
1 net 1 Class 1
1
Maxnet
X2

3 Y Class 2
2 net2
4

Wij = ei(j)/2
Where ei(j) is the i'th component of the j'th exemplar vector.

Terminology
M: number of exemplar vectors
N: number of input nodes (input vector components) E(j) : j'th
exemplar vector

Algorithm:

Step 1: Initialize the weights


wij = ei(j)/2 = i'th component of the j'th exemplar, ( i=
1,2,....n, and j = 1,2,. ....................... m )
Initialize bias values, bj = n/2
For each input vector X do steps 2 to 4

Step 2: Compute net input to each output unit Yj as: Yinj = bi +


∑eij xij ( i = 1,2,…n, j = 1,2,…m )

Step 3: Maxnet iterations are used to find the best match exemplar.

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Example: Given the exemplar vector e(1)=(-1 1 1 -1) and
e(2)=(1 -1 1 -1). Use Hamming net to find the exemplar vector close to bipolar input
patterns
(1 1 -1 -1), (1 -1 -1 -1), (-1 -1 -1 1) and (-1 -1 1 1).

Y Class 1
1 net 1
1
Maxnet
X2

3 Class 2
Y
2
4

b2

Solution:

Step 1: Store the exemplars in the weights as:


wij = ei(j)/2 = i'th component of the j'th exemplar,

-0.5 0.5
0.5 -0.5
0.5 0.5

-0.5 -0.5


Since e(1) = (-1 1 1 -1) and e(2) = (1 -1 1 -1).
bj = n/2 = 2

٠
١
Downloaded by Raghu Polishetti (ragspol@gmail.com)
lOMoARcPSD|34321553

٠
١
step 2: Apply 1st bipolar input (1 1 -1 -1) Yin1
= b1 + ∑ xi wi1
= 2 + (1 1 -1 -1) .* (-0.5 0.5 0.5 -0.5)
=2
Yin2 = b2 + ∑ xi wi2
= 2 + (1 1 -1 -1) .* (0.5 -0.5 0.5 -0.5)
=2
Hence, the first input patter has the same Hamming distance HD = 2
with both exemplar vectors.

Step 3: Apply the second input vector (1 -1 -1 -1)


Yin1 = 2 + (1 -1 -1 -1) .* (-0.5 0.5 0.5 -0.5) =1
Yin2 = 2 + (1 -1 -1 -1) .* (0.5 -0.5 0.5 -0.5) =3
Since y2 > y1, then the second input best matches with the second
exemplar e(2).

Step 4: Apply input pattern no. 3 (-1 -1 -1 1)


Yin1 = 2 + (-1 -1 -1 1) .* (-0.5 0.5 0.5 -0.5) = 1
Yin2 = 2 + (-1 -1 -1 1) .* (0.5 -0.5 0.5 -0.5) = 1
Hence we have Hamming similarity.

Step 5: Consider the last input vector (-1 -1 1 1)


Yin1 = 2 + (-1 -1 1 1) .* 0.5 (-1 1 1 -1) = 2
Yin2 = 2+ (-1 -1 1 1) .* 0.5 (1 -1 1 -1) = 2
Hence we have Hamming similarity

Kohonen self organizing feature maps:

There can be various topologies, however the following two topologies are used the most −

28

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Rectangular Grid Topology

This topology has 24 nodes in the distance-2 grid, 16 nodes in the distance-1 grid, and 8 nodes in
the distance-0 grid, which means the difference between each rectangular grid is 8 nodes. The
winning unit is indicated by #.

Hexagonal Grid Topology

This topology has 18 nodes in the distance-2 grid, 12 nodes in the distance-1 grid, and 6 nodes in
the distance-0 grid, which means the difference between each rectangular grid is 6 nodes. The
winning unit is indicated by #.

Architecture

The architecture of KSOM is similar to that of the competitive network. With the help of
neighborhood schemes, discussed earlier, the training can take place over the extended region of
the network.

29

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Algorithm for training

Step 1 − Initialize the weights, the learning rate α and the neighborhood topological scheme.
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every input vector x.
Step 4 − Calculate Square of Euclidean Distance for j = 1 to m
D(j)=∑i=1n∑j=1m(xi−wij)2D(j)=∑i=1n∑j=1m(xi−wij)2
Step 5 − Obtain the winning unit J where Djj is minimum.
Step 6 − Calculate the new weight of the winning unit by the following relation −
wij(new)=wij(old)+α[xi−wij(old)]wij(new)=wij(old)+α[xi−wij(old)]

Step 7 − Update the learning rate α by the following relation −


α(t+1)=0.5αtα(t+1)=0.5αt

Step 8 − Reduce the radius of topological scheme.


Step 9 − Check for the stopping condition for the network.

Learning Vector Quantization LVQ


LVQ, different from Vector quantization VQVQ and Kohonen Self-Organizing Maps KSOM
.KSOM, basically is a competitive network which uses supervised learning. We may define it as
a process of classifying the patterns where each output unit represents a class. As it uses
supervised learning, the network will be given a set of training patterns with known classification
along with an initial distribution of the output class. After completing the training process, LVQ
will classify an input vector by assigning it to the same class as that of the output unit.
Architecture:

30

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Following figure shows the architecture of LVQ which is quite similar to the architecture of
KSOM. As we can see, there are “n” number of input units and “m” number of output units. The
layers are fully interconnected with having weights on them.

Parameters Used:

Following are the parameters used in LVQ training process as well as in the flowchart
• x = training vector (x1,...,xi,...,xn)
• T = class for training vector x
• wj = weight vector for jth output unit
• Cj = class associated with the jth output unit
Training Algorithm:

Step 1 − Initialize reference vectors, which can be done as follows −


• Step 1aa − From the given set of training vectors, take the first “m” number of clusters
number of clusters training vectors and use them as weight vectors. The remaining vectors
can be used for training.
• Step 1bb − Assign the initial weight and classification randomly.
• Step 1cc − Apply K-means clustering method.
Step 2 − Initialize reference vector αα
Step 3 − Continue with steps 4-9, if the condition for stopping this algorithm is not met.
Step 4 − Follow steps 5-6 for every training input vector x.
Step 5 − Calculate Square of Euclidean Distance for j = 1 to m and i = 1 to n
D(j)=∑i=1n∑j=1m(xi−wij)2D(j)=∑i=1n∑j=1m(xi−wij)2
Step 6 − Obtain the winning unit J where Djj is minimum.
Step 7 − Calculate the new weight of the winning unit by the following relation −

31

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
if T = Cj then wj(new)=wj(old)+α[x−wj(old)]wj(new)=wj(old)+α[x−wj(old)]
if T ≠ Cj then wj(new)=wj(old)−α[x−wj(old)]wj(new)=wj(old)−α[x−wj(old)]
Step 8 − Reduce the learning rate αα.
Step 9 − Test for the stopping condition. It may be as follows −

• Maximum number of epochs reached.


• Learning rate reduced to a negligible value.

CPN (COUNTERPROPAGATION NETWORK):

CPN (Counter propagation network) were proposed by Hecht Nielsen in 1987.They are
multilayer network based on the combinations of the input, output, and clustering layers. The
application of counter propagation net are data compression, function approximation and pattern
association. The counter propagation network is basically constructed from an instar-outstar
model. This model is three layer neural network that performs input-output data mapping,
producing an output vector y in response to input vector x, on the basis of competitive learning.
The three layer in an instar-outstar model are the input layer, the hidden(competitive) layer and
the output layer.
There are two stages involved in the training process of a counter propagation net. The
input vector are clustered in the first stage. In the second stage of training, the weights from the

32

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
cluster layer units to the output units are tuned to obtain the desired response. There are two types
of counter propagation net:
1. Full counter propagation network
2. Forward-only counter propagation network

1. Full counter propagation network:


Full CPN efficiently represents a large number of vector pair x:y by adaptively constructing a
look-up-table. The full CPN works best if the inverse function exists and is continuous. The vector
x and y propagate through the network in a counterflow manner to yield output vector x* and y*.

Architecture of Full CPN:


The four major components of the instar-outstar model are the input layer, the instar, the
competitive layer and the outstar. For each node in the input layer there is an input value xi. All
the instar are grouped into a layer called the competitive layer. Each of the instar responds
maximally to a group of input vectors in a different region of space. An outstar model is found to
have all the nodes in the output layer and a single node in the competitive layer. The outstar looks
like the fan-out of a node.

Training Algorithm for Full CPN:


33

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Step 0: Set the weights and the initial learning rate.


Step 1: Perform step 2 to 7 if stopping condition is false for phase I training.
Step 2: For each of the training input vector pair x:y presented, perform step 3 to .
Step 3: Make the X-input layer activations to vector X.
Make the Y-input layer activation to vector Y.
Step 4: Find the winning cluster unit.
If dot product method is used, find the cluster unit zj with target net input; for j=1 to p,
zinj=∑xi.vij + ∑yk.wkj
If Euclidean distance method is used, find the cluster unit zj whose squared distance from input
vectors is the smallest:
Dj=∑(xi-vij)^2 + ∑(yk-wkj)^2
If there occurs a tie in case of selection of winner unit, the unit with the smallest index is the
winner. Take the winner unit index as J.
Step 5: Update the weights over the calculated winner unit zj.
For i=1 to n, viJ(new)=viJ(old) + α[xi-viJ(old)]
For k =1 to m, wkJ(new)=wkJ(old) + β[yk-wkJ(old)]
Step 6: Reduce the learning rates.
α (t+1)=0.5α(t); β(t+1)=0.5β(t)
Step 7: Test stopping condition for phase I training.
Step 8: Perform step 9 to 15 when stopping condition is false for phase II training.
Step 9: Perform step 10 to 13 for each training input vector pair x:y. Here α and β are small
constant values.
Step 10: Make the X-input layer activations to vector x. Make the Y-input layer activations to
vector y.
Step 11: Find the winning cluster unit (Using the formula from step 4). Take the winner unit index
as J.
Step 12: Update the weights entering into unit zJ.
For i=1 to n, viJ(new)=viJ(old) + α[xi-viJ(old)]
For k =1 to m, wkJ(new)=wkJ(old) + β[yk-wkJ(old)]
Step 13: Update the weights from unit zj to the output layers.
For i=1 to n, tJi(new)=tJi(old) + b[xi-tJi(old)]
For k =1 to m, uJk(new)=uJk(old) + a[yk-uJk(old)]

34

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Step 14: Reduce the learning rates a and b.
a(t+1)=0.5a(t); b(t+1)=0.5b(t)
Step 15: Test stopping condition for phase II training.

2.Forward-only Counter propagation network:


A simplified version of full CPN is the forward-only CPN. Forward-only CPN uses only
the x vector to form the cluster on the Kohonen units during phase I training. In case of forward-
only CPN, first input vectors are presented to the input units. First, the weights between the input
layer and cluster layer are trained. Then the weights between the cluster layer and output layer are
trained. This is a specific competitive network, with target known.
Architecture of forward-only CPN:
It consists of three layers: input layer, cluster layer and output layer. Its architecture
resembles the back-propagation network, but in CPN there exists interconnections between the
units in the cluster layer.

.
Training Algorithm for Forward-only CPN:
Step 0: Initialize the weights and learning rates.
Step 1: Perform step 2 to 7 when stopping condition for phase I training is false.
Step 2: Perform step 3 to 5 for each of training input X.
Step 3: Set the X-input layer activation to vector X.
Step 4: Compute the winning cluster unit J. If dot product method is used, find the cluster unit zJ
with the largest net input:
zinj=∑xi.vij

35

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
If Euclidean distance is used, find the cluster unit zJ square of whose distance from the input
pattern is smallest:
Dj=∑(xi-vij)^2
If there exists a tie in the selection of winner unit, the unit with the smallest index is chosen as the
winner.
Step 5: Perform weight updation for unit zJ. For i=1 to n,
viJ(new)=viJ(old) + α[xi-viJ(old)]
Step 6: Reduce learning rate α:
α (t+1)=0.5α(t)
Step 7: Test the stopping condition for phase I training.
Step 8: Perform step 9 to 1 when stopping condition for phase II training is false.
Step 9: Perform step 10 to 13 for each training input pair x:y.
Step 10: Set X-input layer activations to vector X. Set Y-output layer activation to vector Y.
Step 11: Find the winning cluster unit J.
Step 12: Update the weights into unit zJ. For i=1 to n,
viJ(new)=viJ(old) + α[xi-viJ(old)]
Step 13: Update the weights from unit zJ to the output units.
For k=1 to m, wJk(new)=wJk(old) + β[yk-wJk(old)]
Step 14: Reduce learning rate β,
β(t+1)=0.5β(t)
Step 15: Test the stopping condition for phase II training.

Adaptive Resonance Theory (ART):

Adaptive resonance theory is a type of neural network technique developed by Stephen


Grossberg and Gail Carpenter in 1987. The basic ART uses unsupervised learning technique. The
term “adaptive” and “resonance” used in this suggests that they are open to new learning(i.e.
adaptive) without discarding the previous or the old information(i.e. resonance). The ART
networks are known to solve the stability-plasticity dilemma i.e., stability refers to their nature of
memorizing the learning and plasticity refers to the fact that they are flexible to gain new
information.

36

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Types of Adaptive Resonance Theory(ART)

Carpenter and Grossberg developed different ART architectures as a result of 20 years of research.
The ARTs can be classified as follows:
• ART1 – It is the simplest and the basic ART architecture. It is capable of clustering binary
input values.
• ART2 – It is extension of ART1 that is capable of clustering continuous-valued input data.
• Fuzzy ART – It is the augmentation of fuzzy logic and ART.
• ARTMAP – It is a supervised form of ART learning where one ART learns based on the
previous ART module. It is also known as predictive ART.
• FARTMAP – This is a supervised ART architecture with Fuzzy logic included.

Basic of Adaptive Resonance Theory (ART) Architecture

The adaptive resonant theory is a type of neural network that is self-organizing and
competitive. It can be of both types, the unsupervised ones(ART1, ART2, ART3, etc) or the
supervised ones(ARTMAP). Generally, the supervised algorithms are named with the suffix
“MAP”.
But the basic ART model is unsupervised in nature and consists of :
• The F1 layer accepts the inputs and performs some processing and transfers it to the F2
layer that best matches with the classification factor.
There exist two sets of weighted interconnection for controlling the degree of similarity
between the units in the F1 and the F2 layer.
• The F2 layer is a competitive layer.The cluster unit with the large net input becomes the
candidate to learn the input pattern first and the rest F2 units are ignored.
• The reset unit makes the decision whether or not the cluster unit is allowed to learn the
input pattern depending on how similar its top-down weight vector is to the input vector
and to he decision. This is called the vigilance test.
Thus we can say that the vigilance parameter helps to incorporate new memories or new
information. Higher vigilance produces more detailed memories, lower vigilance
produces more general memories.
Generally two types of learning exists,slow learning and fast learning.In fast learning, weight
update during resonance occurs rapidly. It is used in ART1.In slow learning, the weight change
occurs slowly relative to the duration of the learning trial. It is used in ART2.

➢ Advantage of Adaptive Resonance Theory (ART)

• It exhibits stability and is not disturbed by a wide variety of inputs provided to its network.
• It can be integrated and used with various other techniques to give more good results.

37

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• It can be used for various fields such as mobile robot control, face recognition, land cover
classification, target recognition, medical diagnosis, signature verification, clustering web
users, etc.
• It has got advantages over competitive learning (like bpnn etc). The competitive learning
lacks the capability to add new clusters when deemed necessary

➢ Limitations of Adaptive Resonance Theory:

Some ART networks are inconsistent (like the Fuzzy ART and ART1) as they depend upon
the order in which training data, or upon the learning rate.

****

38

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
UNIT – III
Introduction to Deep Learning, Historical Trends in Deep learning, Deep Feed -
forward networks, Gradient-Based learning, Hidden Units, Architecture Design, Back-
Propagation and Other Differentiation Algorithms

Introduction to Deep Learning


Deep learning is a sub-field of machine learning dealing with algorithms inspired by
the structure and function of the brain called artificial neural networks. In other words, it mirrors
the functioning of our brains. Deep learning algorithms are similar to how nervous system
structured where each neuron connected each other and passing information.

Example of different representations: suppose we want to separate two categories of data


by drawing a line between them in a scatterplot.

Deep learning allows the computer to build complex concepts out of simpler
concepts.
Below figure shows how a deep learning system can represent the concept of an image of
a person by combining simpler concepts, such as corners and contours, which are in turn defined
in terms of edges. The quintessential example of a deep learning model is the feedforward deep
network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function
mapping some set of input values to output values.

39

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Figure 1.2: Illustration of a deep learning model.

Figure 1.3: Illustration of computational graphs mapping an input to an output where


each node performs an operation.

There are two main ways of measuring the depth of a model. The first view is based
on the number of sequential instructions that must be executed to evaluate the architecture. Above
figure illustrates how this choice of language can give two different measurements for the same
architecture. Another approach, used by deep probabilistic models, regards the depth of a model
as being not the depth of the computational graph but the depth of the graph describing how
concepts are related to each other.

Historical Trends in Deep learning


It is easiest to understand deep learning with some historical context. Rather than
providing a detailed history of deep learning, we identify a few key trends:

• Deep learning has had a long and rich history, but has gone by many names reflecting
different philosophical viewpoints, and has waxed and waned in popularity.

40

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Deep learning has become more useful as the amount of available training data has
increased.
• Deep learning models have grown in size over time as computer infrastructure (both
hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with increasing accuracy
over time.

The Many Names and Changing Fortunes of Neural Networks

Broadly speaking, there have been three waves of development of deep learning:
deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism
in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006.

Some of the earliest learning algorithms we recognize today were intended to be


computational models of biological learning, i.e. models of how learning happens or could happen
in the brain. As a result, one of the names that deep learning has gone by is artificial neural
networks (ANNs).

Fig: This figure shows two of the three historical waves of artificial neural nets research,
as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural
networks” according to Google Books.

Increasing Dataset Sizes

One may wonder why deep learning has only recently become recognized as a
crucial technology though the first experiments with artificial neural networks were conducted in
the 1950s. As our computers are increasingly networked together, it becomes easier to centralize
these records and curate them into a dataset appropriate for machine learning applications. As of
2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve
acceptable performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10 million labeled

41

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
examples. Working successfully with datasets smaller than this is an important research area,
focusing in particular on how we can take advantage of large quantities of unlabeled examples,
with unsupervised or semi-supervised learning.

Increasing Model Sizes

Another key reason that neural networks are wildly successful today after enjoying
comparatively little success since the 1980s is that we have the computational resources to run
much larger models today. The increase in model size over time, due to the availability of faster
CPUs, the advent of general purpose GPUs, faster network connectivity and better software
infrastructure for distributed computing, is one of the most important trends in the history of deep
learning. This trend is generally expected to continue well into the future.

Deep Feed - forward networks

Deep feedforward networks, also often called feedforward neural networks, or


multilayer perceptrons (MLPs), are the quintessential deep learning models. The goal of a
feedforward network is to approximate some function f ∗. For example, for a classifier, y = f ∗(x)
maps an input x to a category y. A feedforward network defines a mapping y = f (x; θ) and learns
the value of the parameters θ that result in the best function approximation.

These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations used to define f, and
finally to the output y. There are no feedback connections in which outputs of the model are fed
back into itself.

Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is associated
with a directed acyclic graph describing how the functions are composed together. For
example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) =
f(3)(f (2)(f(1) (x ))). These chain structures are the most commonly used structures of neural
networks. In this case, f (1) is called the first layer of the network, f (2) is called the second layer,
and so on. The overall length of the chain gives the depth of the model. It is from this terminology
that the name “deep learning” arises. The final layer of a feedforward network is called the output
layer. The learning algorithm must decide how to use these layers to best implement an
approximation of f∗. Because the training data does not show the desired output for each of these
layers, these layers are called hidden layers.

Finally, these networks are called neural because they are loosely inspired by
neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of
these hidden layers determines the width of the model.

Feedforward networks have introduced the concept of a hidden layer, and this
requires us to choose the activation functions that will be used to compute the hidden layer values.
We must also design the architecture of the network, including how many layers the network

42

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
should contain, how these layers should be connected to each other, and how many units should
be in each layer. Learning in deep neural networks requires computing the gradients of complicated
functions. We present the back-propagation algorithm and its modern generalizations, which can
be used to efficiently compute these gradients.

Figure : An example of a feedforward network, drawn in two different styles. Specifically,


this is the feedforward network we use to solve the XOR example.

It has a single hidden layer containing two units. (Left)In this style, we draw every
unit as a node in the graph. This style is very explicit and unambiguous but for networks larger
than this example it can consume too much space. (Right)In this style, we draw a node in the graph
for each entire vector representing a layer’s activations. This style is much more compact.
Sometimes we annotate the edges in this graph with the name of the parameters that describe the
relationship between two layers. Here, we indicate that a matrix W describes the mapping from x
to h, and a vector w describes the mapping from h to y.

Gradient-Based Learning
Designing and training a neural network is not much different from training any
other machine learning model with gradient descent. Computing the gradient is slightly more
complicated for a neural network, but can still be done efficiently and exactly.
As with other machine learning models, to apply gradient-based learning we must
choose a cost function, and we must choose how to represent the output of the model.

Cost Functions

An important aspect of the design of a deep neural network is the choice of the cost
function. Fortunately, the cost functions for neural networks are more or less the same as those for
other parametric models, such as linear models.

43

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
In most cases, our parametric model defines a distribution p(y | x;θ ) and we simply
use the principle of maximum likelihood. This means we use the cross-entropy between the
training data and the model’s predictions as the cost function.
The total cost function used to train a neural network will often combine one of the
primary cost functions described here with a regularization term.
➢ Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood. This means
that the cost function is simply the negative log-likelihood, equivalently described as the cross-
entropy between the training data and the model distribution. This cost function is given by
J(θ) = −E x,y∼pˆdata logpmodel(y|x)

Output Units

The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entropy between the data distribution and the model
distribution. The choice of how to represent the output then determines the form of the cross-
entropy function.
Any kind of neural network unit that may be used as an output can also be used as
a hidden unit. we suppose that the feedforward network provides a set of hidden features defined
by h = f (x ;θ ). The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.
➢ Linear Units for Gaussian Output Distributions

One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity. These are often just called linear units.
Given features h, a layer of linear output units produces a vector yˆ = WTh+b
.Linear output layers are often used to produce the mean of a conditional
Gaussian distribution:
p(y|x) = N(y;yˆ,I).
Hidden Units

The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles. Rectified linear units are an excellent default choice
of hidden unit. The design process consists of trial and error, intuiting that a kind of hidden unit
may work well, and then training a network with that kind of hidden unit and evaluating its
performance on a validation set.
Some of the hidden units included in this list are not actually differentiable at all
input points. For example, the rectified linear function g(z) = max{0,z} is not differentiable at z =
0. This may seem like it invalidates g for use with a gradient based learning algorithm.

44

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Unless indicated otherwise, most hidden units can be described as accepting a
vector of inputs x, computing an affine transformation z = W T x + b, and then applying an element-
wise nonlinear function g(z).
Most hidden units are distinguished from each other only by the choice of the form
of the activation function .

Rectified Linear Units and Their Generalizations


Rectified linear units use the activation function g(z) = max{0,z }.
Rectified linear units are easy to optimize because they are so similar to linear units. The
only difference between a linear unit and a rectified linear unit is that a rectified linear unit outputs
zero across half its domain.
Rectified linear units are typically used on top of an affine transformation:
h=g(WT x + b)
One drawback to rectified linear units is that they cannot learn via gradient based methods
on examples for which their activation is zero.
Logistic Sigmoid and Hyperbolic Tangent
Prior to the introduction of rectified linear units, most neural networks used the
logistic sigmoid activation function
g (z)=σ (z)
or the hyperbolic tangent activation function
g (z) = tanh(z)
These activation functions are closely related because tanh(z)=2σ(2z)-1
Sigmoidal activation functions are more common in settings other than feedforward
networks. Recurrent networks, many probabilistic models, and some autoencoders have additional
requirements that rule out the use of piecewise linear activation functions and make sigmoidal
units more appealing despite the drawbacks of saturation.
Other Hidden Units
Many other types of hidden units are possible, but are used less frequently. In general, a
wide variety of differentiable functions perform perfectly well. Many unpublished activation
functions perform just as well as the popular ones.

45

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
One possibility is to not have an activation g(z) at all. One can also think of this as using
the identity function as the activation function. We have already seen that a linear unit can be
useful as the output of a neural network. It may also be used as a hidden unit.
Softmax units are another kind of unit that is usually used as an output but may sometimes
be used as a hidden unit. Softmax units naturally represent a probability distribution over a discrete
variable with k possible values, so they may be used as a kind of switch.
A few other reasonably common hidden unit types include:
• Radial basis function or RBF unit: hi= exp(− 1/σ2i ||W:,i – x||2). This function becomes
more active as x approaches a template W:,i. Because it saturates to for most , it can be difficult to
optimize.
• Softplus: g(a) = ζ(a) = log(1+ea). This is a smooth version of the rectifier for function
approximation and for the conditional distributions of undirected probabilistic models.
• Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the latter, it is
bounded, g(a) = max(−1 , min(1,a)).

Architecture Design
The word architecture refers to the overall structure of the network: how many units
it should have and how these units should be connected to each other.
Most neural networks are organized into groups of units called layers. Most neural network
architectures arrange these layers in a chain structure, with each layer being a function of the layer
that preceded it. In this structure, the first layer is given by
h(1)= g(1)(W(1)Tx + b(1))
the second layer is given by
h(2)= g(2)(W(2)T h(1) + b(2))
and so on.
In these chain-based architectures, the main architectural considerations are to
choose the depth of the network and the width of each layer. The ideal network architecture for a
task must be found via experimentation guided by monitoring the validation set error.
Universal Approximation Properties and Depth
A linear model, mapping from features to outputs via matrix multiplication, can by
definition represent only linear functions. It has the advantage of being easy to train because many
loss functions result in convex optimization problems when applied to linear models.

46

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
The universal approximation theorem states that a feedforward network with a
linear output layer and at least one hidden layer with any “squashing” activation function (such as
the logistic sigmoid activation function) can approximate any Borel measurable function from one
finite-dimensional space to another with any desired non-zero amount of error, provided that the
network is given enough hidden units.
The universal approximation theorem means that regardless of what function we
are trying to learn, we know that a large MLP will be able to represent this function.
In summary, a feedforward network with a single layer is sufficient to represent any
function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In
many circumstances, using deeper models can reduce the number of units required to represent the
desired function and can reduce the amount of generalization error.

Figure: An intuitive, geometric explanation of the exponential advantage of deeper rectifier


networks
More precisely, the main theorem in Montufar et al. states that the number of linear
regions carved out by a deep rectifier network with d inputs, depth , and units per hidden layer, is

i.e., exponential in the depth . In the case of maxout networks with filters per l k unit, the
number of linear regions is

47

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
o(k(l-1)+d)

Other Architectural Considerations


Many neural network architectures have been developed for specific tasks.
Specialized architectures for computer vision called convolutional networks. Feedforward
networks may also be generalized to the 9 recurrent neural networks for sequence processing.
Many architectures build a main chain but then add extra architectural features to
it, such as skip connections going from layer i to layer i+2 or higher. These skip connections make
it easier for the gradient to flow from output layers to layers nearer the input.
Another key consideration of architecture design is exactly how to connect a pair
of layers to each other. In the default neural network layer described by a linear transformation via
a matrix W, every input unit is connected to every output unit.

Figure: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses.

Back-Propagation and Other Differentiation Algorithms


When we use a feedforward neural network to accept an input x and produce an
output ˆy, information flows forward through the network. The inputs x provide the initial
information that then propagates up to the hidden units at each layer and finally produces yˆ. This
is called forward propagation. During training, forward propagation can continue onward until

48

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
it produces a scalar cost J (θ). The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to then flow backwards through
the network, in order to compute the gradient..
The term back-propagation is often misunderstood as meaning the whole learning
algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method
for computing the gradient, while another algorithm, such as stochastic gradient descent, is used
to perform learning using this gradient.
Computational Graphs
To describe the back-propagation algorithm more precisely, it is helpful to have a
more precise language. computational graph Many ways of formalizing computation as graphs are
possible. Here, we use each node in the graph to indicate a variable. The variable may be a scalar,
vector, matrix, tensor, or even a variable of another type. To formalize our graphs, we also need
to introduce the idea of an operation. An operation is a simple function of one or more variables.

Figure: Examples of computational graphs

49

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Chain Rule of Calculus
Back-propagation is an algorithm that computes the chain rule, with a specific order
of operations that is highly efficient. Let x be a real number, and let f and g both be functions
mapping from a real number to a real number. Suppose that y = g(x) and z = f(g(x)) = f(y). Then
the chain rule states that
dz/dx = (dz/ dy) (dy/dx ).
Recursively Applying the Chain Rule to Obtain Backprop
Using the chain rule, it is straightforward to write down an algebraic expression for
the gradient of a scalar with respect to any node in the computational graph that produced that
scalar.
Specifically, many subexpressions may be repeated several times within the overall
expression for the gradient. Any procedure that computes the gradient will need to choose whether
to store these subexpressions or to recompute them several times. An example of how these
repeated subexpressions arise is given in figure .

Figure 6.9: A computational graph that results in repeated subexpressions when computing
the gradient.

The back-propagation algorithm is designed to reduce the number of common


subexpressions without regard to memory.

Symbol-to-Symbol Derivatives

50

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Algebraic expressions and computational graphs both operate on symbols, or
variables that do not have specific values. These algebraic and graph-based representations are
called symbolic representations. When we actually use or train a neural network, we must assign
specific values to these symbols. We replace a symbolic input to the network x with a specific
numeric value, such as [1.2,3.765,−1.8]T.

Figure: An example of the symbol-to-symbol approach to computing derivatives. In


this approach, the back-propagation algorithm does not need to ever access any actual
specific numeric values. Instead, it adds nodes to a computational graph describing how
to compute these derivatives.

Some approaches to back-propagation take a computational graph and a set of


numerical values for the inputs to the graph, then return a set of numerical
values describing the gradient at those input values. We call this approach “symbol-to-
number” differentiation.

Another approach is to take a computational graph and add additional nodes to the
graph that provide a symbolic description of the desired derivatives.

General Back-Propagation

The back-propagation algorithm is very simple. To compute the gradient of some


scalar z with respect to one of its ancestors x in the graph, we begin by observing that the gradient
with respect to z is given by dz/dz = 1. We can then compute the gradient with respect to each
parent of z in the graph by multiplying the current gradient by the Jacobian of the operation that
produced z. We continue multiplying by Jacobians traveling backwards through the graph in this
way until we reach x. For any node that may be reached by going backwards from z through two
or more paths, we simply sum the gradients arriving from different paths at that node.

51

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
More formally, each node in the graph G corresponds to a variable. To achieve
maximum generality, we describe this variable as being a tensor V. Tensor can in general have
any number of dimensions. They subsume scalars, vectors, and matrices.

We assume that each variable is associated with the following V subroutines:


• get_operation(V): This returns the operation that computes V, represented
by the edges coming into V in the computational graph. For example, there may be a Python
or C++ class representing the matrix multiplication operation, and the get_operation function.
Suppose we have a variable that is created by matrix multiplication, C = AB. Then
get_operation(V) returns a pointer to an instance of the corresponding C++ class.
• get_consumers(V, G): This returns the list of variables that are children of
V in the computational graph G.
• get_inputs(V, G): This returns the list of variables that are parents of V
in the computational graph G.

The back-propagation algorithm itself does not need to know any differentiation
rules. It only needs to call each operation’s bprop rules with the right arguments. Formally,
op.bprop(inputs,X,G) must return

Here, inputs is a list of inputs that are supplied to the operation, op.f is the
mathematical function that the operation implements, X is the input whose gradient we
wish to compute, and G is the gradient on the output of the operation.

Software implementations of back-propagation usually provide both the operations


and their bprop methods, so that users of deep learning software libraries are able to back-
propagate through graphs built using common operations like matrix multiplication, exponents,
logarithms, and so on. Software engineers who build a new implementation of back-propagation
or advanced users who need to add their own operation to an existing library must usually derive
the op.bprop method for any new operations manually.

Complications

Most software implementations need to support operations that can return more
than one tensor. For example, if we wish to compute both the maximum value in a tensor and the
index of that value, it is best to compute both in a single pass through memory, so it is most efficient
to implement this procedure as a single operation with two outputs.

We have not described how to control the memory consumption of back


propagation.Back-propagation often involves summation of many tensors together. In the
naive approach, each of these tensors would be computed separately, then all of them would be
added in a second step. The naive approach has an overly high memory bottleneck that can be
avoided by maintaining a single buffer and adding each value to that buffer as it is computed.

52

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Real-world implementations of back-propagation also need to handle various data


types, such as 32-bit floating point, 64-bit floating point, and integer values.The policy for
handling each of these types takes special care to design.

Some operations have undefined gradients, and it is important to track these


cases and determine whether the gradient requested by the user is undefined.

Various other technicalities make real-world differentiation more


complicated.These technicalities are not insurmountable, and this chapter has described the key
intellectual tools needed to compute derivatives, but it is important to be aware that many more
subtleties exist.

Differentiation outside the Deep Learning Community

The deep learning community has been somewhat isolated from the broader
computer science community and has largely developed its own cultural attitudes concerning how
to perform differentiation. More generally, the field of automatic differentiation is concerned
with how to compute derivatives algorithmically.
The back-propagation algorithm described here is only one approach to automatic
differentiation. It is a special case of a broader class of techniques called reverse mode
accumulation. Other approaches evaluate the subexpressions of the chain rule in different orders.
In general, determining the order of evaluation that results in the lowest computational cost is a
difficult problem. Finding the optimal sequence of operations to compute the gradient is NP-
complete (Naumann, 2008), in the sense that it may require simplifying algebraic expressions into
their least expensive form.

53

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

UNIT - IV : Regularization for Deep Learning Parameter norm Penalties, Norm Penalties as
Constrained Optimization, Regularization and Under-Constrained Problems, Dataset
Augmentation, Noise Robustness, Semi-Supervised learning, Multi-task learning, Early Stopping,
Parameter Typing and Parameter Sharing, Sparse Representations, Bagging and other Ensemble
Methods, Dropout, Adversarial Training, Tangent Distance, tangent Prop and Manifold, Tangent
Classifier

REGULARIZATION FOR DEEP LEARNING

In Machine Learning, and more so in Deep Learning, overfitting is a major issue that occurs
during training. A model is considered as overfitting the training data when the training error
keeps decreasing but the test error (or the generalisation error) starts increasing. At this point we
tend to believe that the model is learning the training data distribution and not generalising to
unseen data. Regularization is a modification we make to the learning algorithm or the model
architecture that reduces its generalisation error, possibly at the expense of increased training
error. There are various ways of doing this, some of which include restriction on parameter
values or adding terms to the objective function, etc.

These constraints are designed to encode some sort of prior knowledge, with a preference
towards simpler models to promote generalisation (see Occam’s Razor). The sections present in
this chapter are listed below:

1. Parameter Norm Penalties


2. Norm Penalties as Constrained Optimization

54

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
3. Regularization and Under-Constrained Problems
4. Dataset Augmentation
5. Noise Robustness
6. Semi-Supervised Learning
7. Mutlitask Learning
8. Early Stopping
9. Parameter Tying and Parameter Sharing
10. Sparse Representations
11. Bagging and Other Ensemble Methods
12. Dropout
13. Adversarial Training
14. Tangent Distance, Tangent Prop and Manifold Tangent Classifier

1. Parameter Norm Penalties

The idea here is to limit the capacity (the space of all possible model families) of the model by
adding a parameter norm penalty, Ω(θ), to the objective function, J:

Here, θ

represents only the weights and not the biases, the reason being that the biases require much less
data to fit and do not add much variance.

1.1 L² Parameter Regularization

Here, we have the following parameter norm penalty:

Applying the 2nd order Taylor-Series approximation (ignoring all terms of order greater than 2 in
the Taylor-Series expansion) at the point w* (where J̃(θ; X, y) assumes the minimum value, i.e.,
∇J̃ (w*)= 0), we get the following expression (as the first order gradient term is 0):

55

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Finally, ∇ Ĵ(w) = H(w — w*) since the first term is just a constant and the derivative of X’ H X (’
represents the transpose) is 2 H X. The overall gradient of the objective function (gradient of Ĵ +
gradient of Ω(θ))becomes:

As α approaches 0, w comes closer to w*. Finally, since H is real and symmetric, it can be
decomposed into a diagonal matrix ∧ and an orthonormal set of eigenvectors, Q.That is, H = Q’
∧ Q.

Because of the marked


term, the value of each
weight is rescaled
along the eigenvectors
of H. The value of the weights along the eigenvector i,is rescaled by λi /(λi+α), where λi
represents the eigenvalue corresponding to that eigenvector.

The diagram below illustrates this well:

56

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

To look at its application to Machine Learning, we have to look at linear regression. The
objective function there is exactly quadratic, given by:

1.2 L¹ parameter
regularization

Here, the parameter norm penalty is given by: Ω(θ) = ||w||¹

This makes the gradient of the overall objective function:

57

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Now, the last term, sign(w), creates some difficulty as the gradient no longer scales linearly with
w. This leads to a few complexities in arriving at the optimal solution.

My current interpretation of the max term is that, there shouldn’t be a zero crossing, as the
gradient of the absolute value function is not differentiable at zero.

Thus, L¹ regularization has the property of sparsity, which is its fundamental distinguishing
feature from L². Hence, L¹ is used for feature selection as in LASSO.

2. Norm penalties as constrained optimization

From chapter 4’s section 4, we know that to minimize any function under some constraints, we
can construct a generalized Lagrangian function containing the objective function along with the
penalties. Suppose we wanted Ω(θ) < k, then we could construct the following Lagrangian:

We get optimal θ by solving the Lagrangian. If Ω(θ) > k, then the weights need to be
compensated highly and hence, α should be large to reduce its value below k. Likewise, if
Ω(θ)<k, then the norm shouldn’t be reduced too much and hence, α should be small. This is now
similar to the parameter norm penalty regularized objective function as both of them encourage
lower values of the norm. Thus, parameter norm penalties naturally impose a constraint, like the
L²-regularization, defining a constrained L²-ball. Larger α implies a smaller constrained region as
it pushes the values really low, hence, allowing a small radius and vice versa. The idea of
constraints over penalties is important for several reasons. Large penalties might cause non-
convex optimization algorithms to get stuck in local minima due to small values of θ, leading to
the formation of so-called dead cells, as the weights entering and leaving them are too small to
have an impact. Constraints don’t enforce the weights to be near zero, rather being confined to a
constrained region.

Another reason is that constraints induce higher stability. With higher learning rates, there might
be a large weight, leading to a large gradient, which could go on iteratively leading to numerical
overflow in the value of θ. Constrains, along with reprojection (to the corresponding ball),
prevent the weights from becoming too large, thus, maintaining stability.

58

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
A final suggestion made by Hinton was to restrict the individual column norms of the weight
matrix rather than the Frobenius norm of the entire weight matrix, so as to prevent any hidden
unit from having a large weight. The idea here is that if we restrict the Frobenius norm, it doesn’t
guarantee that the individual weights would be small, just their norm. So, we might have large
weights being compensated by extremely small weights to make the overall norm small.
Restricting each hidden unit individually gives us the required guarantee.

3. Regularized & Under-constrained problems

Underdetermined problems are those problems that have infinitely many solutions. A logistic
regression problem having linearly separable classes with w as a solution, will always have 2w
as a solution and so on. In some machine learning problems, regularization is necessary. For e.g.,
many algorithms (e.g. PCA) require the inversion of X’ X, which might be singular. In such a
case, we can use a regularized form instead. (X’ X + αI) is guaranteed to be invertible.

Regularization can solve underdetermined problems. For e.g. the Moore-Pentose pseudoinverse
defined earlier as:

This can be seen as performing a linear regression with L²-regularization.

4. Data augmentation
Having more data is the most desirable thing to improving a machine learning model’s
performance. In many cases, it is relatively easy to artificially generate data. For a classification
task, we desire for the model to be invariant to certain types of transformations, and we can
generate the corresponding (x,y)pairs by translating the input x. But for certain problems, like
density estimation, we can’t apply this directly unless we have already solved the density
estimation problem.

However, caution needs to be maintained while augmenting data to make sure that the class
doesn’t change. For e.g., if the labels contain both “b” and “d”, then horizontal flipping would be
a bad idea for data augmentation. Adding random noise to the inputs is another form of data
augmentation, while adding noise to hidden units can be seen as doing data augmentation at
multiple levels of abstraction.

Finally, when comparing machine learning models, we need to evaluate them using the same
hand-designed data augmentation schemes or else it might happen that algorithm A outperforms
algorithm B, just because it was trained on a dataset which had more / better data augmentation.

5. Noise Robustness

59

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Noise with infinitesimal variance imposes a penalty on the norm of the weights. Noise added to
hidden units is very important and is discussed later in Dropout. Noise can even be added to the
weights. This has several interpretations. One of them is that adding noise to weights is a
stochastic implementation of Bayesian inference over the weights, where the weights are
considered to be uncertain, with the uncertainty being modelled by a probability distribution. It is
also interpreted as a more traditional form of regularization by ensuring stability in learning.

For e.g. in the linear regression case, we want to learn the mapping y(x) for each feature vector x,
by reducing the mean square error.

Now, suppose a zero mean unit variance Gaussian random noise, ϵ, is added to the weights. We
till want to learn the appropriate mapping through reducing the mean square. Minimizing the loss
after adding noise to the weights is equivalent to adding another regularization term which makes
sure that small perturbations in the weight values don’t affect the predictions much, thus
stabilising training.

Sometimes we may have the wrong output labels, in which case maximizing p(y | x)may not be a
good idea. In such a case, we can add noise to the labels by assigning a probability of (1-ϵ) that
the label is correct and a probability of ϵ that it is not. In the latter case, all the other labels are
equally likely. Label Smoothing regularizes a model with k softmax outputs by assigning the
classification targets with probability (1-ϵ ) or choosing any of the remaining (k-1) classes with
probability ϵ / (k-1).

6. Semi-Supervised Learning

P(x,y) denotes the joint distribution of x and y, i.e., corresponding to a training sample x, I have a
label y. P(x) denotes the marginal distribution of x, i.e., just the training examples without any
labels. In Semi-supervised Learning, we use both P(x,y)(some labelled samples) and
P(x)(unlabelled samples) to estimate P(y|x)(since we want to predict the class, given the training
sample). We want to learn some representation h = f(x)such that samples which are closer in the
input space have similar representations and a linear classifier in the new space achieves better
generalization error.

Instead of separating the supervised and unsupervised criteria, we can instead have a generative
model of P(x) (or P(x, y)) which shares parameters with the discriminative model. The idea is to
share the unsupervised/generative criterion with the supervised criterion to express a prior belief
that the structure of P(x) (or P(x, y)) is connected to the structure of P(y|x), which is expressed
by the shared parameters.

7. Multitask Learning

The idea is to improve the generalization error by pooling together examples from multiple tasks.

60

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Similar to how more data leads to more generalization, using a part of the model for different
tasks constrains that part to learn good values. There are two types of model parameters:

1. Task-specific: These parameters benefit only from that particular task.


2. Generic, shared across all tasks: These are the ones which benefit from learning
through various tasks.

Multitask learning leads to better generalization when there is actually some relationship
between the tasks, which actually happens in the context of Deep Learning where some of the
factors, which explain the variation observed in the data, are shared across different tasks.

8. Early Stopping

As mentioned at the start of the post, after a certain point of time during training, for a model
with extremely high representational capacity, the training error continues to decrease but the
validation error begins to increase (which we referred to as overfitting). In such a scenario, a
better idea would be to return back to the point where the validation error was the least. Thus, we
need to keep calculating the validation metric after each epoch and if there is any improvement,
we store that parameter setting. Upon termination of training, we return the last saved
parameters.

The idea of Early Stopping is that if the validation error doesn’t improve over a certain fixed
number of iterations, we terminate the algorithm. This effectively reduces the capacity of the
model by reducing the number of steps required to fit the model. The evaluation on the validation

61

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
set can be done both on another GPU in parallel or done after the epoch. A drawback of weight
decay was that we had to manually tweak the weight decay coefficient, which, if chosen
wrongly, can lead the model to local minima by squashing the weight values too much. In Early
Stopping, no such parameter needs to be tweaked which reduces the number of hyperparameters
that we need to tune.

However, since we are setting aside some part of the training data for validation, we are not
using the complete training set. So, once Early Stopping is done, a second phase of training can
be done where the complete training set is used. There are two choices here:

Train from scratch for the same number of steps as in the Early Stopping case.
Use the weights learned from the first phase of training and retrain using the complete data.

Other than lowering the number of training steps, it reduces the computational cost also by
regularizing the model without having to add additional penalty terms. It affects the optimization
procedure by restricting it to a small volume of the parameter space, in the neighbourhood of the
initial parameters. Suppose 𝛕 and ϵ represent the number of iterations and the learning rate
respectively. Then, ϵ𝛕 effectively represents the capacity of the model. Intuitively, this can be
seen as the inverse of the weight decay co-efficient λ. When ϵ𝛕 is small (or λ is large), the
parameter space is small and vice versa. This equivalence holds true for a linear model with
quadratic cost function (initial parameters w⁰ = 0). Taking the Taylor Series Approximation of
J(w) around the empirically optimal weights w*:

multiplying with Q’ on both sides and using the fact that Q’Q = I (Q is orthonormal):

Assuming ϵ to be small enough:

The equation for L² regularization is given by:

62

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Thus, if the hyperparameters are such that:

L²-regularization can be seen as equivalent to Early Stopping.

9. Parameter Tying and Parameter Sharing

Till now, most of the methods focused on bringing the weights to a fixed point, e.g. 0 in the case
of norm penalty. However, there might be situations where we might have some prior knowledge
on the kind of dependencies that the model should encode. Suppose, two models A and B,
perform a classification task on similar input and output distributions. In such a case, we’d
expect the parameters for both the models to be similar to each other as well. We could impose a
norm penalty on the distance between the weights, but a more popular method is to force the set
of parameters to be equal. This is the essence behind Parameter Sharing. A major benefit here is
that we need to store only a subset of the parameters (e.g. storing only the parameters for model
A instead of storing for both A and B) which leads to large memory savings. In the example of
Convolutional Neural Networks or CNNs (discussed in Chapter 9), the same feature is computed
across different regions of the image and hence, a cat is detected irrespective of whether it is at
position ior i+1 .

10. Sparse Representations


We can place penalties on even the activation values of the units which indirectly imposes a
penalty on the parameters. This leads to representational sparsity, where many of the activation
values of the units are zero. In the figure below, h is a representation of x, which is sparse.
Representational sparsity is obtained similarly to the way parameter sparsity is obtained, by
placing a penalty on the representation h instead of the weights.

63

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Another idea could be to average the activation values across various examples and push it
towards some value. An example of getting representational sparsity by imposing hard constraint
on the activation value is the Orthogonal Matching Pursuit (OMP) algorithm, where a
representation h is learned for the input x by solving the constrained optimization problem:

where the constraint is on the the number


of non-zero entries indicated by b. The problem can be solved efficiently when W is restricted to
be orthogonal

11. Bagging and Other Ensemble Methods

The techniques which train multiple models and take the maximum vote across those models for
the final prediction are called ensemble methods. The idea is that it’s highly unlikely that
multiple models would make the same error in the test set.

Suppose that we have K regression models, with the model #i making an error ϵi on each
example, where ϵi is drawn from a zero mean, multivariate normal distribution such that:
𝔼(ϵi²)=v and 𝔼(ϵiϵj)=c. The error on each example is then the average across all the models: (∑
ϵi)/K.

The mean of this average error is 0 (as the mean of each of the individual ϵiϵi is 0). The variance
of the average error is given by:

64

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Thus, if c = v, then there is no change. If c = 0, then the variance of the average error decreases
with K. There are various ensembling techniques. In the case of Bagging (Bootstrap
Aggregating), the same training algorithm is used multiple times. The dataset is broken into K
parts by sampling with replacement (see figure below for clarity) and a model is trained on each
of those K parts. Because of sampling with replacement, the K parts have a few similarities as
well as a few differences. These differences cause the difference in the predictions of the K
models. Model averaging is a very strong technique.

65

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

12. Dropout

Dropout is a computationally inexpensive, yet powerful regularization technique. The problem


with bagging is that we can’t train an exponentially large number of models and store them for
prediction later. Dropout makes bagging practical by making an inexpensive approximation. In a
simplistic view, dropout trains the ensemble of all sub-networks formed by randomly removing a
few non-output units by multiplying their outputs by 0. For every training sample, a mask is
computed for all the input and hidden units independently. For clarification, suppose we have h
hidden units in some layer. Then, a mask for that layer refers to a h dimensional vector with
values either 0(remove the unit) or 1(keep the unit).

There are a few differences from bagging though:

In bagging, the models are independent of each other, whereas in dropout, the different models
share parameters, with each model taking as input, a sample of the total parameters.
In bagging, each model is trained till convergence, but in dropout, each model is trained for
just one step and the parameter sharing makes sure that subsequent updates ensure better
predictions in the future.

At test time, we combine the predictions of all the models. In the case of bagging with K models,

66

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
this was given by the arithmetic mean. In case of dropout, the probability that a model is chosen
is given by p(μ), with μ denoting the mask vector. The prediction then becomes ∑ p(μ)p(y|x, μ).
This is not computationally feasible, and there’s a better method to compute this in one go, using
the geometric mean instead of the arithmetic mean.

We need to take care of two main things when working with geometric mean:

None of the probabilities should be zero.


Re-normalization to make sure all the probabilities sum to 1.

The advantage for dropout is that first term can be approximated in one pass of the complete
model by dividing the weight values by the keep probability (weight scaling inference rule). The
motivation behind this is to capture the right expected values from the output of each unit, i.e. the
total expected input to a unit at train time is equal to the total expected input at test time. A big
advantage of dropout then is that it doesn’t place any restriction on the type of model or training
procedure to use.

Points to note:

Reduces the representational capacity of the model and hence, the model should be large
enough to begin with.
Works better with more data.
Equivalent to L² for linear regression, with different weight decay coefficient for each input
feature.

Biological Interpretration:

During sexual reproduction, genes could be swapped between organisms if they are unable to
correctly adapt to the unusual features of any organism. Thus, the units in dropout learn to
perform well regardless of the presence of other hidden units, and also in many different
contexts.

Adding noise in the hidden layer is more effective than adding noise in the input layer. For e.g.
let’s assume that some unit learns to detect a nose in a face recognition task. Now, if this unit is
removed, then some other unit either learns to redundantly detect a nose or associates some other
feature (like mouth) for recognising a face. In either way, the model learns to make more use of
the information in the input. On the other hand, adding noise to the input won’t completely
removed the nose information, unless the noise is so large as to remove most of the information
from the input.

67

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
13. Adversarial Training

Deep Learning has outperformed humans in the task of Image Recognition, which might lead us
to believe that these models have acquired a human-level understanding of an image. However,
experimentally searching for an x′ (given an x), such that prediction made by the model changes,
shows otherwise. As shown in the image below, although the newly formed image (adversarial
image) looks almost exactly the same to a human, the model classifies it wrongly and that too
with very high confidence:

Adversarial training refers to training on images which are adversarially generated and it has
been shown to reduce the error rate. The main factor attributed to the above mentioned behaviour
is the linearity of the model (say y = Wx), caused by the main building blocks being primarily
linear. Thus, a small change of ϵ in the input causes a drastic change of Wϵ in the output. The
idea of adversarial training is to avoid this jumping and induce the model to be locally constant
in the neighborhood of the training data.

This can also be used in semi-supervised learning. For an unlabelled sample x, we can assign the
label 礃Ȁ (x) using our model. Then, we find an adversarial example, x′, such that y(x′)≠礃Ȁ (x) (an
adversary found this way is called virtual adversarial example). The objective then is to assign
the same class to both x and x′. The idea behind this is that different classes are assumed to lie on
disconnected manifolds and a little push from one manifold shouldn’t land in any other manifold.

14. Tangent Distance, Tangent Prop and manifold Tangent Classifier

Many ML models assume the data to lie on a low dimensional manifold to overcome the curse of
dimensionality. The inherent assumption which follows is that small perturbations that cause the
data to move along the manifold (it originally belonged to), shouldn’t lead to different class
predictions. The idea of the tangent distance algorithm to find the K-nearest neighbors using the
distance metric as the distance between manifolds. A manifold Mi is approximated by the tangent
plane at Xi, hence, this technique needs tangent vectors to be specified.

68

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

The

tangent prop algorithm proposed to learn a neural network based classifier, f(x), which is
invariant to known transformations causing the input to move along its manifold. Local
invariance would require that ▽ f(x) is perpendicular to the tangent vectors V(i). This can also
be achieved by adding a penalty term that minimizes the directional directive of f(x) along each
of the V(i).

It is similar to data augmentation in that both of them use prior knowledge of the domain to
specify various transformations that the model should be invariant to. However, tangent prop
only resists infinitesimal perturbations while data augmentation causes invariance to much larger
perturbations.

Manifold Tangent Classifier works in two parts:

Use Autoencoders to learn the manifold structures using Unsupervised Learning.


Use these learned manifolds with tangent prop.

69

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
UNIT – V: Optimization for Train Deep Models Challenges in Neural Network Optimization,
Basic Algorithms, Parameter Initialization Strategies, Algorithms with Adaptive Learning Rates,
Approximate Second-Order Methods, Optimization Strategies and Meta-Algorithms Applications:
Large-Scale Deep Learning, Computer Vision, Speech Recognition, Natural Language Processing

1. CHALLENGES IN NEURAL NETWORK OPTIMIZATION

The optimization problem for training neural networks is generally non-convex. Some of
the challenges faced are mentioned below:

• Ill-conditioning of the Hessian Matrix: The Hessian matrix and condition number have
been covered in our summary for Chapter 4. For the sake of completion, the Hessian
matrix H of a function f with a vector-valued input x is given as:
• Local minima: Nearly any Deep Learning (DL ) model is guaranteed to have an extremely
large number of local minima (LM) arising due to the model identifiability problem.

• Plateaus, Saddle Points and Other Flat Regions: Saddle point (SP) is another type of
point with zero gradient where some points around it have higher value and the others have
lower. Intuitively, this means that a saddle point acts as both a local minima for some
neighbors and a local maxima for the others. Thus, Hessian at SP has both positive and
negative eigenvalues (a very good explanation for this can be found here. TL;DR — for a
function to curve upwards or downwards around a point as in the case of local minima and
local maxima, the eigenvalues should have the same sign, positive for local minima and
negative for local maxima).
• Cliffs and Exploding Gradients: Neural Networks (NNs) might sometimes have
extremely steep regions resembling cliffs due to the repeated multiplication of weights.
Suppose we use a 3-layer (input-hidden-output) neural network with all the activation
functions as linear. We choose the same number of input, hidden and output neurons, thus,
using the same weight W for each layer. The output layer y = W*h where h =
W*x represents the hidden layer, finally giving y = W*W x. So, deep neural networks
involve multiplication of a large number of parameters leading to sharp non-linearities in

70

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
the parameter space. These non-linearities give rise to high gradients in some places. At the
edge of such a cliff, an update step might throw the parameters extremely far.

Image depicting the problem of exploding gradients when approaching a cliff. 1) Usual
training going on with the parameters moving towards the lower cost region. 2) The gradient at
the bottom left-most point pointed downwards (correct direction) but the step-size was too large,
which caused the parameters to land at a point having large cost value. 3) The gradient at this new
point moved the parameters in a completely different position undoing most of the training done
until that point.

• Long-Term Dependencies: This problem is encountered when the NN becomes


sufficiently deep. For example, if the same weight matrix W is used in each layer,
after t steps, we’d get W *W * W … (t times). Using the eigendecomposition of W:

Here, V is an orthonormal matrix, i.e. V V’ = I

Thus, any eigenvalues not near an absolute value of one would either explode or vanish
leading to the Vanishing and Exploding Gradient problem. The use of the same weight matrix is
especially the case in Recurrent NNs (RNNs), where this is a serious problem.

Values near 1 either explode or vanish upon being compounded. You might have seen this
poster in a separate context.

• Inexact Gradients: Most optimization algorithms use a noisy/biased estimate of the


gradient in cases where the estimate is based on sampling, or in cases where the true gradient
is intractable for e.g. in the case of training a Restricted Boltzmann Machine (RBM), an
approximation of the gradient is used. For RBM, the contrastive divergence algorithm gives
a technique for approximating the gradient of its intractable log-likelihood.

71

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

2. BASIC ALGORITHMS
• Stochastic Gradient Descent: This has already been described before but there are certain
things that should be kept in mind regarding SGD. The learning rate ϵ is a very important
parameter for SGD. ϵ should be reduced after each epoch in general. This is due to the fact
that the random sampling of batches acts as a source of noise which might make SGD keep
oscillating around the minima without actually reaching it. This is shown below:

• Momentum: The momentum algorithm accumulates the exponentially decaying moving


average of past gradients (called as velocity) and uses it as the direction in which to take the
next step. Momentum is given by mass times velocity, which is equal to velocity if we’re
using unit mass. The momentum update is given by:

Momentum weight update

The step size (earlier equal to learning rate * gradient) now depends on
how large and aligned the sequence of gradients are. If the gradients at each iteration point in the
same direction (say g), it will lead to a higher value of the step size as they just keep accumulating.
Once it reaches a constant (terminal) velocity, the step size becomes ϵ || g|| / (1 — α). Thus, using
α as 0.9 makes the speed 10 times. Common values of α are 0.5, 0.9 and 0.99.

Viewing it as the Newtonian dynamics of a particle sliding down a hill, the momentum
algorithm consists of solving a set of differential equations via numerical simulation. There are two
kinds of forces involved as shown below:

Momentum can be seen as two forces operating together. 1) Proportional to the negative
of the gradient such that whenever it descends a steep part of the surface, it gathers speed and
continues sliding in that direction until it goes uphill again. 2) A viscous drag force (friction)

72

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
proportional to -v(t) without the presence of which the particle would keep oscillating back and
forth as the negative of the gradient would keep forcing it to move downhill . Viscous force is
suitable as it is weak enough to allow the gradient to cause motion and strong enough to resist any
motion if the gradient doesn’t justify moving

Read more about momentum in this excellent blog post by distill.ai: Why Momentum
Really Works.

• Nesterov Momentum: This is a slight modification of the usual momentum equation. Here,
the gradient is calculated after applying the current velocity to the parameters, which can be
viewed as adding a correction factor:

3. PARAMETER INITIALIZATION STRATEGIES

Training algorithms for deep learning models are iterative in nature and require the
specification of an initial point. This is extremely crucial as it often decides whether or not the
algorithm converges and if it does, then does the algorithm converge to a point with high cost or
low cost.

We have limited understanding of neural network optimization but the one property that we
know with complete certainty is that the initialization should break symmetry. This means that if
two hidden units are connected to the same input units, then these should have different initialization
or else the gradient would update both the units in the same way and we don’t learn anything new
by using an additional unit. The idea of having each unit learn something different motivates
random initialization of weights which is also computationally cheaper.

Biases are often chosen heuristically (zero mostly) and only the weights are randomly
initialized, almost always from a Gaussian or uniform distribution. The scale of the distribution is
of utmost concern. Large weights might have better symmetry-breaking effect but might lead to
chaos (extreme sensitivity to small perturbations in the input) and exploding values during forward
& back propagation. As an example of how large weights might lead to chaos, consider that there’s
a slight noise adding ϵ to the input. Now, we if did just a simple linear transformation like W * x,
the ϵ noise would add a factor of W * ϵ to the output. In case the weights are high, this ends up
making a significant contribution to the output. SGD and its variants tend to halt in areas near the
initial values, thereby expressing a prior that the path to the final parameters from the initial values
is discoverable by steepest descent algorithms. A more mathematical explanation for the symmetry
breaking can be found in the Appendix.

73

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Various suggestions have been made for appropriate initialization of the parameters. The
most commonly used ones include sampling the weights of each fully-connected layer
having m inputs and n outputs uniformly from the following distributions:

• U(-1 / √m, 1 / √m)

• U(- √6 / (m+n), √6 / (m+n))

U(a, b) represents the uniform distribution where the probability of each value between a
and b, a and b inclusive, is 1/(b-a). The probability of every other value is 0.

These initializations have already been incorporated into the most commonly used Deep
Learning frameworks nowadays so that you can just specify which initializer to use and the
framework takes care of sampling appropriately. For e.g. Keras, which is a very famous deep
learning framework, has a module called initializers, where the second distribution (among the 2
mentioned above) is implemented as glorot_uniform .

One drawback of using 1 / √m as the standard deviation is that the weights end up being
small when a layer has too many input/output units. Motivated by the idea to have the total amount
of input to each unit independent of the number of input units m, Sparse initialization sets each
unit to have exactly k non-zero weights. However, it takes a long time for GD to correct incorrect
large values and hence, this initialization might cause problems.

If the weights are too small, the range of activations across the mini-batch will shrink as the
activations propagate forward through the network.By repeatedly identifying the first layer with
unacceptably small activations and increasing its weights, it is possible to eventually obtain a
network with reasonable initial activations throughout.

The biases are relatively easier to choose. Setting the biases to zero is compatible with most
weight initialization schemes except for a few cases for e.g. when used for an output unit, to prevent
saturation at initialization or when using unit as a gate for making a decision. Refer to the chapter for
details.

4. ALGORITHMS WITH ADAPTIVE LEARNING RATES

• AdaGrad: As mentioned in Part I , it is important to incrementally decrease the learning


rate for faster convergence. Instead of manually reducing the learning rate after each (or
74

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
several) epochs, a better approach is to adapt the learning rate as the training progresses.
This can be done by scaling the learning rates of each model parameter individually
inversely proportional to the square root of the sum of historical squared values of the
gradient. In the parameter update equation below, r is initialized with 0 and the
multiplication in the update step happens element-wise as mentioned. Since the gradient
value would be different for each parameter, the learning rate is scaled differently for each
parameter too. Thus, those parameters having a large gradient have a large decrease in the
learning rate as the learning rate might be too high leading to oscillations or it might be
approaching the minima but having large learning rate might cause it to jump over the
minima as explained in the figure below, because of which the learning rate should be
decreased for better convergence, while those with small gradients have a small decrease in
the learning rate as they might have already approached their respective minima and should
not be pushed away from that. Even if they have not, reducing the learning rate too much
would reduce the gradient even further leading to slower learning.

AdaGrad parameter update equation.

This figure illustrates the need to reduce the learning rate if gradient is large in case of a
single parameter. 1) One step of gradient descent representing a large gradient value. 2) Result of
reducing the learning rate — moves towards the minima 3) Scenario if the learning rate was not
reduced — it would have jumped over the minima.

However, accumulation of squared gradients from the very beginning can lead to excessive
and premature decrease in the learning rate. Consider that we had a model with only 2 parameters
(for simplicity) and both the initial gradients are 1000. After some iterations, the gradient of one of
the

75

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
Figure explaining the problem with AdaGrad. Accumulated gradients can cause the
learning rate to be reduced far too much in the later stages leading to slower learning.

parameters has reduced to 100 but that of the other parameter is still around 750. However,
because of the accumulation at each update, the accumulated gradient would still have almost the
same value. For e.g. let the accumulated gradients at each step for the Parameter 1 be 1000 + 900 +
700 + 400 + 100 = 3100, 1/3100=0.0003 and that for Parameter 2 be: 1000 + 900 + 850 + 800 +
750 = 4300, 1/4300 = 0.0002. This would lead to a similar decrease in the learning rates for both
the parameters, even though the parameter having the lower gradient might have its learning rate
reduced too much leading to slower learning.

• RMSProp: RMSProp addresses the problem caused by accumulated gradients in AdaGrad.


It modifies the gradient accumulation step to an exponentially weighted moving average in
order to discard history from the extreme past. The RMSProp update is given by:

ρ is the weighing used for exponential averaging. As more updates are made, the
contribution of past gradient values are reduced since ρ < 1 and ρ > ρ² >ρ³ …

This allows the algorithm to converge rapidly after finding a convex bowl, as if it were an
instance of AdaGrad initialized within that bowl. Let me explain why this is so. Consider the figure
below. The region represented by 1 indicates usual RMSProp parameter updates as given by the
update equation, which is nothing but exponentially averaged AdaGrad updates. Once the
optimization process lands on A, it essentially lands at the top of a convex bowl. At this point,
intuitively, all the updates before A can be seen to be forgotten due to the exponential averaging
and it can be seen as if (exponentially averaged) AdaGrad updates start from point A onwards.

Intuition behind RMSProp. 1) Usual parameter updates 2) Once it reaches the convex bowl,
exponentially weighted averaging would cause the effect of earlier gradients to reduce and to
simplify, we can assume their contribution to be zero. This can be seen as if AdaGrad had been
used with the training initiated inside the convex bowl

76

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and
Momentum. Firstly, it views Momentum as an estimate of the first-order moment and
RMSProp as that of the second moment. The weight update for Adam is given by:

Secondly, since s and r are initialized as zeros, the authors observed a bias during the initial
steps of training thereby adding a correction term for both the moments to account for their
initialization near the origin. As an example of what the effect of this bias correction is, we’ll look
at the values of s and r for a single parameter (in which case everything is now represented as a
scalar). Let’s first understand what would happen if there was no bias correction. Since s (notice
that this is not in bold as we are looking at the value for a single parameter and the s here is a
scalar) is initialized as zero, after the first iteration, the value of s would be (1 — ρ1) * g and that
of r would be (1 — ρ2) * g². The preferred values for ρ1 and ρ2 are 0.9 and 0.99 respectively. Thus,
the initial values of s and r are pretty small and this gets compounded as the training progress.
However, if we now use bias correction, after the first iteration, the value of s is just g and that
of r is just g². This gets rid of the bias that occurs in the initial phase of training. A major advantage
of Adam is that it’s fairly robust to the choice of these hyperparameters, i.e. ρ1 and ρ2.

The figure below shows the comparison between the various optimization methods
discussed above. It can be clearly seen that algorithms with adaptive learning rates provide faster
convergence:

NAG here refers to Nesterov Accelerated Gradient which is the same as Nesterov
Momentum.

5. APPROXIMATE SECOND-ORDER METHODS

The optimization algorithms that we’ve looked at till now involved computing only the first
derivative. But there are many methods which involve higher order derivatives as well. The main
problem with these algorithms are that they are not practically feasible in their vanilla form and so,
certain methods are used to approximate the values of the derivatives. We explain three such
methods, all of which use empirical risk as the objective function:

77

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Newton’s Method: This is the most common higher-order derivative method used. It makes
use of the curvature of the loss function via its second-order derivative to arrive at the
optimal point. Using the second-order Taylor Series expansion to approximate J(θ) around
a point θo and ignoring derivatives of order greater than 2 (this has already been discussed
in previous chapters), we get:

We know that we get a critical point for any function f(x) by solving for f'(x) = 0. We get
the following critical point of the above equation (refer to the Appendix for proof):

For quadratic surfaces (i.e. where cost function is quadratic), this directly gives the optimal
result in one step whereas gradient descent would still need to iterate. However, for surfaces that
are not quadratic, as long as the Hessian remains positive definite, we can obtain the optimal point
through a 2-step iterative process — 1) Get the inverse of the Hessian and 2) update the parameters.

Saddle points are problematic for Newton’s method. If all the eigenvalues are not positive,
Newton’s method might cause the updates to move in the wrong direction. A way to avoid this is
to add regularization:

However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in which case the Hessian
becomes dominated by the diagonal matrix. This leads to an update which becomes the standard
gradient divided by α:

Another problem restricting the use of Newton’s method is the computational cost. It
takes O(k³) time to calculate the inverse of the Hessian where k is the number of parameters. It’s
not uncommon for Deep Neural Networks to have about a million parameters and since the
parameters are updated every iteration, this inverse needs to be calculated at every iteration, which
is not computationally feasible.

78

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Conjugate Gradients: One weakness of the method of steepest descent (i.e. GD) is that line
searches happen along the direction of the gradient. Suppose the previous search direction
is d(t-1). Once the search terminates (which it does when the gradient along the current
gradient direction vanishes) at the minimum, the next search direction, d(t) is given by the
gradient at that point, which is orthogonal to d(t-1) (because if it’s not orthogonal, it’ll have
some component along d(t-1) which cannot be true as at the minimum, the gradient along
d(t-1) has vanished).
Upon getting the minimum along the current search direction, the minimum along the
previous search direction is not preserved, undoing, in a sense, the progress made in
previous search direction.

In the method of conjugate gradients, we seek a search direction that is conjugate to the
previous line search direction:

Now, the previous search direction contributes towards finding the next search direction.
with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0. βt decides how much of d(t-1) is added
back to the current search direction. There are two popular choices for βt — Fletcher-Reeves and
Polak-Ribière. These discussions assumed the cost function to be quadratic where the conjugate
directions ensure that the gradient along the previous direction does not increase in magnitude. To
extend the concept to work for training neural networks, there is one additional change. Since it’s
no longer quadratic, there’s no guarantee anymore than the conjugate direction would preserve the
minimum in the previous search directions. Thus, the algorithm includes occasional resets where
the method of conjugate gradients is restarted with line search along the unaltered gradient.

• BFGS: This algorithm tries to bring the advantages of Newton’s method without the
additional computational burden by approximating the inverse of H by M(t), which is
iteratively refined using low-rank updates. Finally, line search is conducted along the
direction M(t)g(t). However, BFGS requires storing the matrix M(t) which takes O(n²)
memory making it infeasible. An approach called Limited Memory BFGS (L-BFGS) has
been proposed to tackle this infeasibility by computing the matrix M(t) using the same
method as BFGS but assuming that M(t−1) is the identity matrix.

79

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
6. OPTIMIZATION STRATEGIES AND META-ALGORITHMS

• Batch Normalization: Batch normalization (BN) is one of the most exciting innovations in
Deep learning that has significantly stabilized the learning process and allowed faster
convergence rates. The intuition behind batch normalization is as follows: Most of the Deep
Learning networks are compositions of many layers (or functions) and the gradient with
respect to one layer is taken considering the other layers to be constant. However, in practise
all the layers are updated simultaneously and this can lead to unexpected results. For
example, let y* = x W¹ W² … W¹⁰. Here, y* is a linear function of x but not a linear function
of the weights. Suppose the gradient is given by g and we now intend to reduce y* by 0.1.
Using first-order Taylor Series approximation, taking a step of ϵg would reduce y* by ϵg’
g. Thus, ϵ should be 0.1/(g’ g) just using the first-order information. However, higher order
effects also creep in as the updated y* is given by:

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly small


or exponentially high depending on whether the individual weights are less than or greater than 1.
Since the updates to one layer is so strongly dependent on the other layers, choosing an appropriate
learning rate is tough. Batch normalization takes care of this problem by using an efficient
reparameterization of almost any deep network. Given a matrix of activations, H, the normalization
is given by: H’ = (H-μ) / σ, where the subtraction and division is broadcasted.

𝛿 is added to ensure that σ is not equal to 0.

Going back to the earlier example of y*, let the activations of layer l be given by h(l-1).
Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit Gaussian, then h(l-1) also comes
from a Gaussian, however, not of zero mean and unit variance, as it is a linear transformation of x.
BN makes it zero mean and unit variance. Therefore, y* = Wl h(l-1) and thus, the learning now
becomes much simpler as the parameters at the lower layers mostly do not have any effect. This
simplicity was definitely achieved by rendering the lower layers useless. However, in a realistic
deep network with non-linearities, the lower layers remain useful. Finally, the complete
reparameterization of BN is given by replacing H with γH’ + β. This is done to retain its expressive
power and the fact that the mean is solely determined by XW. Also, among the choice of
normalizating X or XW + B, the authors recommend the latter, specifically XW, since B becomes
redundant because of β. Practically, this means that when we are using the Batch Normalization
layer, the biases should be turned off. In a deep learning framework like Keras, this can be done by
setting the parameter use_bias=False in the Convolutional layer.

80

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Coordinate Descent: Generally, a single weight update is made by taking the gradient with
respect to every parameter. However, in cases where some of the parameters might be
independent (discussed below) of the remaining, it might be more efficient to take the
gradient with respect to those independent sets of parameters separately for making updates.
Let me clarify that with an example. Suppose we have the following cost function:

This cost function describes the learning problem called sparse coding. Here, H refers to the
sparse representation of X and W is the set of weights used to linearly decode H to retrieve X. An
explanation of why this cost function enforces the learning of a sparse representation of X follows.
The first term of the cost function penalizes values far from 0 (positive or negative because of the
modulus, |H|, operator. This enforces most of the values to be 0, thereby sparse. The second term is
pretty self-explanatory in that it compensates the difference between X and H being linearly
transformed by W, thereby enforcing them to take the same value. In this way, H is now learned as
a sparse “representation” of X. The cost function generally consists of additionally
a regularization term like weight decay, which has been avoided for simplicity. Here, we can divide
the entire list of parameters into two sets, W and H. Minimizing the cost function with respect to
any of these sets of parameters is a convex problem. Coordinate Descent (CD) refers to
minimizing the cost function with respect to only 1 parameter at a time. It has been shown that
repeatedly cycling through all the parameters, we are guaranteed to arrive at a local minima. If
instead of 1 parameter, we take a set of parameters as we did before with W and H, it is called block
coordinate descent (the interested reader should explore Alternating Minimization). CD makes
sense if either the parameters are clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with respect to others.

The points A, B, C and D indicates the locations in the parameter space where coordinate
descent landed after each gradient step.

Coordinate descent may fail terribly when one variable influences the optimal value of
another variable.

81

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
• Polyak Averaging: Polyak averaging consists of averaging several points in the parameter
space that the optimization algorithm traverses through. So, if the algorithm encounters the
points θ(1), θ(2), … during optimization, the output of Polyak averaging is:

The figure below explains the intuition behind Polyak averaging:

The optimization algorithm might oscillate back and forth across a valley without ever
reaching the minima. However, the average of those points should be closer to the bottom of the
valley.

Most optimization problems in deep learning are non-convex where the path taken by the
optimization algorithm is quite complicated and it might happen that a point visited in the distant
past might be quite far from the current point in the parameter space. Thus, including such a point
in the distant past might not be useful, which is why an exponentially decaying running average is
taken. This scheme where the recent iterates are weighted more than the past ones is called Polyak-
Ruppert Averaging:

• Supervised Pre-training: Sometimes it’s hard to directly train to solve for a specific task.
Instead it might be better to train for solving a simpler task and use that as an initialization
point for training to solve the more challenging task. As an intuition for why this seems
logical, consider that you didn’t have any background in integration and were asked to learn
how to compute the following integral:

82

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

If you’re anyone close to a normal person, your first reaction would be:However, wouldn’t
it be better if you were asked to understand the more basic integrations first:

I hope you understand what I meant with this example — learning a simpler task would put
you in a better position to understand the more complex task. This particular strategy of training to
solve a simpler task before facing the herculean one is called pretraining. A particular type of
pretraining, called greedy supervised pretraining, firstly breaks a given supervised learning
problem into simpler supervised learning ones and solving for the optimal version of each
component in isolation. To build on the above intuition, the hypothesis as to why this works is that
it gives better guidance to the intermediate layers of the network and helps in both, generalization
and optimization. More often that not, the greedy pretraining is followed by a fine-tuning stage
where all the parts are jointly optimized to search for the optimal solution to the full problem. As
an example, the figure below shows how each hidden layer is trained one at a time, where the input
to the hidden layer being learned is the output of the previously trained hidden layer.

83

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

Greedy supervised pretraining (a) The first hidden layer is being trained only using the
original inputs and outputs. (b) For training the second hidden layer, the hidden-output connection
from the first hidden layer is removed and the output of the first hidden layer is used as the input.

Also, FitNets shows an alternative way to guide the training process. Deep networks are
hard to train mainly because as deeper the model gets, more non-linearities are introduced. The
authors propose the use of a shallower and wider teacher network that is trained first. Then, a
second network which is thinner and deeper, called the student network is trained to predict not
only the final outputs but also the intermediate layers of the teacher network. For those who might
not be clear with what deep, shallow, wide and thin might mean, refer the following diagram:

Explanation of the terms “shallow”, “deep”, “thin” and “wide” in the context of neural
networks.

The idea is that predicting the intermediate layers of the teacher network provides some
hints as to how the layers of the student network should be used and aids the optimization procedure.
It was shown that without the hints to the hidden layers, the students networks performs poorly in
both the training and test data.

• Designing Models to Aid Optimization: Most of the work in deep learning has been
towards making the models easier to optimize rather than designing a more powerful
optimization algorithm. Also, linear functions increase in a particular direction. Thus, if

84

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
there’s an error, there’s a clear direction towards which the output should move to minimize
the error.

APPLICATIONS
1. LARGE-SCALE DEEP LEARNING
• Philosophy of connectionism
– While an individual neuron/feature is not intelligent, a large no. acting together
can exhibit intelligent behavior
– No. of neurons must be large
• Although network sizes have increased exponentially in three decades, ANNs are only as
large as nervous systems of insects
• Since size is important, DL requires highperformance hardware and software infrastructure

2. COMPUTER VISION

• Computer Vision is one of the most active areas for deep learning research, since
– Vision is a task effortless for humans but difficult for computers
• Standard benchmarks for deep learning algorithms are:
– object recognition
– OCR
• Computer vision requires little preprocessing
– Pixel range
• Images should be standardized, so pixels lie in same range [0,1], [-1,1], or [0,255] etc
– Picture size
• Some architectures need a standard size. So images may need to be scaled
• May not be needed with convolutional models which dynamically adjust size of pooling
regions
– Data set augmentation
• Can be seen as a preprocessing step for training set

3. AUTOMATIC SPEECH RECOGNITION

Large-scale automatic speech recognition is the first and most convincing successful case
of deep learning. LSTM RNNs can learn "Very Deep Learning" tasks that involve multi-second
intervals containing speech events separated by thousands of discrete time steps, where one time

85

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١
step corresponds to about 10 ms. LSTM with forget gates is competitive with traditional speech
recognizers on certain tasks.
The initial success in speech recognition was based on small-scale recognition tasks based
on TIMIT. The data set contains 630 speakers from eight major dialects of American English,
where each speaker reads 10 sentences. Its small size lets many configurations be tried. More
importantly, the TIMIT task concerns phone-sequence recognition, which, unlike word-sequence
recognition, allows weak phone bigram language models. This lets the strength of the acoustic
modeling aspects of speech recognition be more easily analyzed. The error rates listed below,
including these early results and measured as percent phone error rates (PER), have been
summarized since 1991.
The debut of DNNs for speaker recognition in the late 1990s and speech recognition around
2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:

• Scale-up/out and accelerated DNN training and decoding


• Sequence discriminative training
• Feature processing by deep models with solid understanding of the underlying mechanisms
• Adaptation of DNNs and related deep models
• Multi-task and transfer learning by DNNs and related deep models
• CNNs and how to design them to best exploit domain knowledge of speech
• RNN and its rich LSTM variants
• Other types of deep models including tensor-based models and integrated deep
generative/discriminative models.
All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype
Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range
of Nuance speech products, etc.) are based on deep learning.

4. NATURAL LANGUAGE PROCESSING (NLP)

Understanding the complexities associated with language whether it is syntax, semantics,


tonal nuances, expressions, or even sarcasm, is one of the hardest tasks for humans to learn.
Constant training since birth and exposure to different social settings help humans develop
appropriate responses and a personalized form of expression to every scenario. Natural Language
Processing through Deep Learning is trying to achieve the same thing by training machines to
catch linguistic nuances and frame appropriate responses. Document summarization is widely
being used and tested in the Legal sphere making paralegals obsolete. Answering questions,
language modeling, classifying text, twitter analysis, or sentiment analysis at a broader level are
all subsets of natural language processing where deep learning is gaining momentum. Earlier

86

Downloaded by Raghu Polishetti (ragspol@gmail.com)


lOMoARcPSD|34321553

٠
١

logistic regression or SVM were used to build time-consuming complex models but now
distributed representations, convolution neural networks, recurrent and recursive neural networks,
reinforcement learning, and memory augmenting strategies are helping achieve greater maturity
in NLP. Distributed representations are particularly effective in producing linear semantic
relationships used to build phrases and sentences and capturing local word semantics with word
embedding (word embedding entails the meaning of a word being defined in the context of its
neighboring words).

87

Downloaded by Raghu Polishetti (ragspol@gmail.com)

You might also like