Unit 2

2 NEURAL NETWORKS
Unit Objectives
Introduction
Learning Outcomes
2.1Artificial Neural Network
2.2 Single Layer Neural Network
2.3 Gradient Descent
2.4 Gradient Descent extension and regularization
2.5 Multilayer Perceptron
2.6 Backpropagation
2.7 Chain rule
2.8 Deep learning model
2.9 Keywords
2.10 Summary
1
UNIT OBJECTIVES
After studying this unit, you will be able to:

• Explain what is Neural network and its history
• Understand about Gradient Descent and its types
• State about Gradient descent extension and regularization
• Elaborate the Multilayer Perceptron
INTRODUCTION
In this unit we will discuss about neural networks. Neural networks are
parallel computing devices, which is basically an attempt to make a
computer model of the brain. The main objective is to develop a system to
perform various computational tasks faster than the traditional systems.
These tasks include pattern recognition and classification, approximation,
optimization, and data clustering.
We will also discuss single layer and multilayer neural network. Further the
different types of deep learning model and their usage are explained in this
unit.
LEARNING OUTCOMES
The content and assessments of this unit have been developed to achieve
the following learning outcomes:
• Understand the role of neural network in deep learning
• Understand single layer and multilayer neural network
• Understand how chain rule can be applied
• Understand about the various types of supervised and unsupervised
deep learning models
2
2.1 ARTIFICIAL NEURAL NETWORK
ANN is an efficient computing system whose central theme is borrowed from the analogy of
biological neural networks. ANNs are also named as “artificial neural systems,” or “parallel
distributed processing systems,” or “connectionist systems.” ANN acquires a large collection
of units that are interconnected in some pattern to allow communication between the units.
These units, also referred to
as nodes or neurons, are simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection link
is associated with a weight that has information about the input signal. This is the most useful
information for neurons to solve a particular problem because the weight usually excites or
inhibits the signal that is being communicated. Each neuron has an internal state, which is
called an activation signal. Output signals, which are produced after combining the input
signals and activation rule, may be sent to other units.
A Brief History of ANN

The history of ANN can be divided into the following three eras −
ANN during 1940s to 1960s
Some key developments of this era are as follows −

1943 − It has been assumed that the concept of neural network
started with the work of physiologist, Warren McCulloch, and mathematician, Walter Pitts,
when in 1943 they modeled a simple neural network using electrical circuits in order to
describe how neurons in the brain might work.
1949 − Donald Hebb’s book, The Organization of Behavior, put forth the fact that repeated
activation of one neuron by another increases its strength each time they are used.
1956 − An associative memory network was introduced by Taylor.
1958 − A learning method for McCulloch and Pitts neuron model named Perceptron was
invented by Rosenblatt.
1960 − Bernard Widrow and Marcian Hoff developed models called "ADALINE" and
“MADALINE.”
ANN during 1960s to 1980s
3
1961 − Rosenblatt made an unsuccessful attempt but proposed the “backpropagation”
scheme for multilayer networks.
1964 − Taylor constructed a winner-take-all circuit with inhibitions among output units.
1969 − Multilayer perceptron MLPMLP was invented by Minsky and Papert.
1971 − Kohonen developed Associative memories.

1976 − Stephen Grossberg and Gail Carpenter developed Adaptive resonance theory.
ANN from 1980s till Present

1982 − The major development was Hopfield’s Energy approach.
1985 − Boltzmann machine was developed by Ackley, Hinton, and Sejnowski.
1986 − Rumelhart, Hinton, and Williams introduced Generalised Delta Rule.
1988 − Kosko developed Binary Associative Memory and also gave the concept of Fuzzy Logic
in ANN.
The historical review shows that significant progress has been made in this field. Neural
network based chips are emerging and applications to complex problems are being developed.
Surely, today is a period of transition for neural network technology.
Biological Neuron
A nerve cell neuron is a special biological cell that processes information. According to an
estimation, there are huge number of neurons, approximately 1011 with numerous
interconnections, approximately 10
Schematic Diagram
4
Fig 2.1 Biological neuron
Working of a Biological Neuron
As shown in the above diagram, a typical neuron consists of the following four parts with the
help of which we can explain its working −
Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
Axon − It is just like a cable through which neurons send the information.
Synapses − It is the connection between the axon and other neuron dendrites
2.2 SINGLE LAYER NEURAL NETWORK
There are two types of architecture. These types focus on the functionality of artificial neural
networks as follows-
Single Layer Perceptron
Multi-Layer Perceptron
Single Layer Perceptron
The single-layer perceptron was the first neural network model,
proposed in 1958 by Frank Rosenbluth. It is one of the earliest models for learning. Our goal is
to find a linear decision function measured by the weight vector w and the bias parameter b.
To understand the perceptron layer, it is necessary to comprehend artificial neural networks
(ANNs).
The artificial neural network (ANN) is an information processing system, whose mechanism is
inspired by the functionality of biological neural circuits. An artificial neural network consists
of several processing units that are interconnected.
This is the first proposal when the neural model is built. The content of the neuron's local
memory contains a vector of weight.
The single vector perceptron is calculated by calculating the sum of the input vector multiplied
by the corresponding element of the vector, with each increasing the amount of the
corresponding
component of the vector by weight. The value that is displayed in the output is the input of an
activation function.
5
2.3 GRADIENT DESCENT
Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected results.
Further, gradient descent is also used to train Neural Networks.
In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine
learning, optimization is the task of minimizing the cost function parameterized by the model's
parameters. The main objective of gradient descent is to minimize the convex function using
iteration of parameter updates. Once these machine learning models are optimized, these
models can be used as powerful tools for Artificial Intelligence and various computer science
applications.
Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th century.
Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It
helps in finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at the
current point, we will get the local maximum of that function.
Fig 2.2 Gradient Descent
6
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To
achieve this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the
optimization process which helps to decide the length of the steps.
Cost-function
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number. It
helps to increase and improve machine learning efficiency by providing feedback to this model
so that it can minimize error and find the local or global minimum. Further, it continuously
iterates along the direction of the negative gradient until the cost function approaches zero.
At this steepest descent point, the model will stop learning further. Although cost function and
loss function are considered synonymous, also there is a minor difference between them. The
slight difference between the loss function and the cost function is about the error within the
training of machine learning models, as loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training set.
The cost function is calculated after making a hypothesis with initial parameters and modifying
these parameters using gradient descent algorithms over known data to reduce the cost
function.
Before starting the working principle of gradient descent, we should
know some basic concepts to find out the slope of a line from linear regression. The equation
for simple linear regression is given as:
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
7
Fig 2.3 Cost-function
The starting point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this starting point, we will derive the first derivative or slope and
then use a tangent line to calculate the steepness of this slope. Further, this slope will inform
the updates to the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:
Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future iteration
and allow it to the point of convergence or local minimum or global minimum. Let's discuss
learning rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the
learning rate is high, it results in larger steps but also leads to risks of overshooting the
8
minimum. At the same time, a low learning rate shows the small step sizes, which compromises
overall efficiency but gives the advantage of more precision.
Fig 2.4 Learning Rate
Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:
1. Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
Advantages of Batch gradient descent:
It produces less noise in comparison to other gradient descent.
It produces stable gradient descent convergence.
It is Computationally efficient as all resources are used for all training samples.
2. Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training example
per iteration. Or in other words, it processes a training epoch for each example within a dataset
and updates each training example's parameters one at a time. As it requires only one training
example at a time, hence it is easier to store in allocated memory. However, it shows some
computational efficiency losses in comparison to batch gradient systems as it shows frequent
updates that require more detail and speed. Further, due to frequent updates, it is also treated
as a noisy gradient. However, sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.
Advantages of Stochastic gradient descent:
9
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a
few advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.
3. MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and speed of
stochastic gradient descent. Hence, we can achieve a special type of gradient descent with
higher computational efficiency and less noisy gradient descent.
Advantages of Mini Batch gradient descent:
It is easier to fit in allocated memory.
It is computationally efficient.
It produces stable gradient descent convergence.
Challenges with the Gradient Descent
Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:
1. Local Minima and Saddle Point:
For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine
learning models achieve the best results.
10
Fig 2.6 Local Minima and Saddle Point
Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum. Local minima generate the shape similar to
the global minimum, where the slope of the cost function increases on both sides of the
current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point,
which reaches a local maximum on one side and a local minimum on the other side. The name
of a saddle point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point in
a local region. In contrast, the name of the global minima is given so because the value of the
loss function is minimum there, globally across the entire domain the loss function.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate
of earlier layers than the later layer of the network. Once this happens, the weight parameters
update until they become insignificant.
11
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is
too large and creates a stable model. Further, in this scenario, model weight increases, and
they will be represented as NaN. This problem can be solved using the dimensionality reduction
technique, which helps to minimize complexity within the model.
2.4 GRADIENT DESCENT EXTENSION AND REGULARIZATION
In the previous topic, we have studied three methods to implement back-propagation in Deep
Learning models:
Gradient Descent
Stochastic Gradient Descent
Mini-Batch Stochastic Gradient Descent
Upon which, we keep the mini-batch because it allows for greater speed, as it does not have
to calculate gradients and errors for the entire dataset, and eliminates the high variability that
exists in the Stochastic Gradient Descent.
Well, there are improvements over these methods, such as Momentum. Besides, there are
other more complex algorithms such as Adam, RMSProp or Adagrad.
Momentum
Imagine being a kid again and having the great idea of putting on your skates, climbing up the
steepest street and starting to go down it. You are total beginners and this is the second time
you have worn skates.
I don’t know if any of you have ever really done this, but well, I have, so let me explain what
happens:
You just start, the speed is small, you even seem to be in control and you could stop at any
time.
But the lower you go, the faster you move: this is called momentum.
so the more road you go down, the more inertia you carry and the faster you go.
Well, for those of you who are curious, the end of the story is that at the end of the steep
street there is a fence.
Well, the Momentum technique is precisely this. As we go down our loss curve when
calculating the gradients and making the updates, we give more importance to the updates
that go in the direction that minimizes the gradient, and less importance to those that go in
other directions.
12
Fig 2.7 Momentum
So, the result is to speed up the training of the network.
Nesterov Momentum
Going back to the example of before: we are going down the road at full speed (because we
have built a lot of momentum) and suddenly we see the end of it. We would like to be able to
brake, to slow down to avoid crashing. Well, this is precisely what Nesterov does.
Nesterov calculates the gradient, but instead of doing it at the current point, it does it at the
point where we know our moment is going to take us, and then apply a correction.
Fig 2.8 Nesterov Momentum

Notice that using the standard moment, we calculate the gradient (small orange vector) and
then take a big step in the direction of the gradient (large orange vector).
Using Nesterov, we would first make a big jump in the direction of our previous gradient (green
vector), measure the gradient and make the appropriate correction (red vector).
In practice, it works a little better than the momentum alone. It’s like calculating the gradient
of weights in the future (because we have added the moment we had calculated).
Both Nesterov’s momentum and the standard momentum are extensions of the SGD.
The methods that we are going to see now are based on adaptive learning rates, allowing us
to accelerate or slow down the speed with which we update the weights. For example, we
could use a high speed at the beginning, and lower it as we approach the minimum.
Adaptive gradient (AdaGrad)
It keeps a history of the calculated gradients (in particular, of the sum of the squared gradients)
and normalizes the “step” of the update.
13
The intuition behind it is that it identifies the parameters with a very high gradient, which
weights update will be very abrupt and then assign to them a lower learning rate to mitigate
this abruptness.
At the same time, the parameters that have a very low gradient will be assigned a high learning
rate.
In this way, we manage to accelerate the convergence of the algorithm.
RMSprop
The problem with AdaGrad is that when calculating the sum of the squared gradients, we are
using a monotonic increasing function, which can cause the learning rate to try to compensate
values that do not stop growing until it becomes zero, thus stopping learning.
What RMSprop proposes is to decrease that sum of the squared gradients using a decay_rate.
Adam
Finally, Adam is one of the most modern algorithms, which improves RMSprop by adding
momentum to the update rule. It introduces 2 new parameters, beta1 and beta2, with
recommended values of 0.9 and 0.999.
Gradient Descent Regularization
Gradient descent seeks to find a local minimum of the cost function by adjusting model
parameters. The cost function (or loss function) maps variables onto a real number
representing a “cost” or value to be minimized.
For our model optimization, we’ll perform least squares optimization, where we seek to
minimize the sum of the differences between our predicted values, and the data values.
Regularization is a process of introducing additional information in order to solve an ill-posed
problem or to prevent overfitting
Cost Function of Ordinary Least Squares (OLS) Regression
N — number of samples
p — number of independent variables or features
x — feature
y — actual target or dependent variable
f(x) — estimated target
14
β — coefficient or weight corresponding to each feature or independent var.
Regularization Parameter ‘λ’: Since both Variance and Bias are functions of coefficients β1, β2,
they will be directly proportional. This will not work. Hence we need an additional parameter
that can regulate the size of Bias term. This regulator is the Regularization parameter ‘λ’
2.5 MULTILAYER PERCEPTRON
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It
is substantially formed from multiple
layers of the perceptron.
The pictorial representation of multi-layer perceptron learning is as shown below-
Fig 2.9 Multi-Layer perceptron
MLP networks are used for supervised learning format. A typical learning algorithm for MLP
networks is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set
of outputs from a set of inputs. An MLP is characterized by several layers of input nodes
connected as a directed graph between the input nodes connected as a directed graph
between the input and output layers. MLP uses backpropagation for training the network. MLP
is a deep learning method.
15
2.6 BACKPROPAGATION
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is
a standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.
How Backpropagation Algorithm Works
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It
efficiently computes one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the computation in the
delta rule.
Consider the following Back propagation neural network example diagram to understand:
Notes
Backpropagatio
n is a flexible
method as it
does not
require prior
Fig 2.10 Backpropagation knowledge
about the
network
How Backpropagation Algorithm Works
Inputs X, arrive through the preconnected path
Input is modeled using real weights W. The weights are usually
randomly selected.
Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
Calculate the error in the outputs
16
ErrorB= Actual Output – Desired Output
Travel back from the output layer to the hidden layer to adjust the weights such that the error
is decreased.
Keep repeating the process until the desired output is achieved
Need of Backpropagation
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
Disadvantages of using Backpropagation
The actual performance of backpropagation on a specific problem is dependent on the input
data.
Back propagation algorithm in data mining can be quite sensitive to noisy data
You need to use the matrix-based approach for backpropagation instead of mini-batch.
2.7 CHAIN RULE
The chain rule allows us to find the derivative of composite functions.
17
It is computed extensively by the backpropagation algorithm, in order to train feedforward
neural networks. By applying the chain
rule in an efficient manner while following a specific order of operations, the backpropagation
algorithm calculates the error gradient of the loss function with respect to each weight of the
network. So, the chain rule allows us to find the derivative of a composite function.
Let’s first define how the chain rule differentiates a composite function, and then break it into
its separate components to understand it better. If we had to consider again the composite
function, h = g(f(x)), then its derivative as given by the chain rule is:
Here, u is the output of the inner function f (hence, u = f(x)), which is then fed as input to the
next function g to produce h (hence, h = g(u)). Notice, therefore, how the chain rule
relates the net output, h, to the input, x, through an intermediate variable, u.
We can summarise:
A composite function is the combination of two (or more) functions.
The chain rule allows us to find the derivative of a composite function.
The chain rule can be generalised to multivariate functions, and represented by a tree
diagram.
The chain rule is applied extensively by the backpropagation algorithm in order to calculate the
error gradient of the loss function with respect to each weight.
2.8 DEEP LEARNING MODEL
There are two types of model:

Supervised Models
Classic Neural Networks (Multilayer Perceptrons)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Unsupervised Models
Self-Organizing Maps (SOMs)
Boltzmann Machines
AutoEncoders
Supervised vs Unsupervised Models
18
There are a number of features that distinguish the two, but the most integral point of
difference is in how these models are trained. While supervised models are trained through
examples of a particular set of data, unsupervised models are only given input data and don’t
have a set outcome they can learn from. So that y-column that we’re always trying to predict is
not there in an unsupervised model. While supervised models have tasks such as regression and
classification and will produce a formula, unsupervised models have clustering and association
rule learning.
Fig 2.11 Supervised vs Unsupervised Models
Classic Neural Networks (Multilayer Perceptrons)
Fig 2.12 Classic Neural Networks (Multilayer Perceptrons)
19
Classic Neural Networks can also be referred to as Multilayer perceptrons. The perceptron
model was created in 1958 by American
psychologist Frank Rosenblatt. Its singular nature allows it to adapt to basic binary patterns
through a series of inputs, simulating the learning patterns of a human-brain. A Multilayer
perceptron is the classic neural network model consisting of more than 2 layers.
When to use
Tabular dataset formatted in rows and columns (CSV files)
Classification and Regression problems where a set of real values is given as the input.
A higher level of flexibility is required in your model. ANNs can be applied to different types of
data.
Convolutional Neural Networks
A more capable and advanced variation of classic artificial neural networks, a Convolutional
Neural Network (CNN) is built to handle a greater amount of complexity around pre-processing,
and computation of data.
CNNs were designed for image data and might be the most efficient and flexible model for
image classification problems. Although CNNs were not particularly built to work with non-
image data, they can achieve stunning results with non-image data as well.
After you have imported your input data into the model, there are 4 parts to building the CNN:
1. Convolution: a process in which feature maps are created out of our input data. A function
is then applied to filter maps.
2. Max-Pooling: enables our CNN to detect an image when presented with modification.
3. Flattening: Flatten the data into an array so CNN can read it.
4. Full Connection: The hidden layer, which also calculates the loss function for our model.
When to use
Image Datasets (including OCR document analysis).
Input data is a 2-dimensional field but can be converted to 1-dimensional internally for faster
processing.
When the model may require great complexity in calculating the output.
Recurrent Neural Networks
Recurrent Neural Networks (RNNs) were invented to be used around predicting sequences.
LSTM (Long short-term memory) is a popular RNN algorithm with many possible use cases:
20
Fig 2.13 Recurrent Neural Networks
When to use:
One to one: a single input mapped to a single output.
e.g — Image Classification
One to many: a single input mapped to a sequence of outputs.
e.g — Image captioning (multiple words from a single image)
Many to one: A sequence of inputs produces a single output.
e.g — Sentiment Analysis (binary output from multiple words)
Many to many: A sequence of inputs produces a sequence of outputs.
e.g — Video classification (splitting the video into frames and labeling each frame separately)
Self-Organizing Maps
Self-Organizing Maps or SOMs work with unsupervised data and usually help with
dimensionality reduction (reducing how many random variables you have in your model). The
output dimension is always 2-dimensional for a self-organizing map. So if we have more than 2
input features, the output is reduced to 2 dimensions. Each synapse connecting out input and
output nodes have a weight assigned to them. Then, each data point competes for
representation in the model. The closest node is called the BMU (best matching unit), and the
SOM updates its weights to move closer to the BMU. The neighbors of the BMU keep decreasing
as the model progresses. The closer to the BMU a node is, the more its weights would change.
Note: Weights are a characteristic of the node itself, they represent where the node lies in the
input space. There is no activation function here (weights are different from what they were in
ANNs).
When to use:
When data provided does not contain an output or a Y column.
Exploration projects to understand the framework behind a dataset.
Creative projects (Music/Text/Video produced by AI).
Dimensionality reduction for feature detection.
Boltzmann Machines
21
In the 4 models above, there’s one thing in common. These models work in a certain direction.
Even though SOMs are unsupervised, they still work in a particular direction as do supervised
models. By direction, I mean:
Input → Hidden Layer → Output.
Fig 2.14 Boltzmann Machines

Boltzmann machines don’t follow a certain direction. All nodes are connected to each other in
a circular kind of hyperspace like in the image.
A Boltzmann machine can also generate all parameters of the model, rather than working with
fixed input parameters.
Such a model is referred to as stochastic and is different from all the above deterministic
models. Restricted Boltzmann Machines are more practical.
When to use:
When monitoring a system (since the BM will learn to regulate)
Building a binary recommendation system
When working with a very specific set of data
AutoEncoders
22
Fig 2.15 AutoEncoders
Autoencoders work by automatically encoding data based on input values, then performing an
activation function, and finally decoding the data for output. A bottleneck of some sort imposed
on the input features, compressing them into fewer categories. Thus, if some inherent structure
exists within the data, the autoencoder model will identify and leverage it to get the output.
Types/Variations of AutoEncoders:
- Sparse AutoEncoders: Where the hidden layer is greater than the input layer but a
regularization technique is applied to reduce overfitting. Adds a constraint on the loss
function, preventing the autoencoder from using all its nodes at a time.
- Denoising AutoEncoders: Another regularization technique in which we take a modified
version of our input values with some of our input values turned in to 0 randomly.
- Contractive AutoEncoders: Adds a penalty to the loss function to prevent overfitting and
copying of values when the hidden layer is greater than the input layer.
- Stacked AutoEncoders: When you add another hidden layer, you get a stacked autoencoder.
It has 2 stages of encoding and 1 stage of decoding.
When to use:
Dimensionality reduction/Feature detection
Building powerful recommendation systems (more powerful than BM)
Encoding features in massive datasets
23
2.9 KEYWORD
• Dendrites − They are tree-like branches, responsible for receiving the information
from other neurons it is connected to. In other sense, we can say that they are like
the ears of neuron.
• Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.
• Axon − It is just like a cable through which neurons send the information.
• Synapses − It is the connection between the axon and other neuron dendrites
• Gradient Descent – It is known as one of the most commonly used optimization
algorithms to train machine learning models by means of minimizing errors
between actual and expected results.
• Stochastic gradient descent (SGD)- It is a type of gradient descent that runs one
training example per iteration. Or in other words, it processes a training epoch for
each example within a dataset and updates each training example's parameters
one at a time.
2.10 SUMMARY
• Neural networks are parallel computing devices, which is basically an attempt to make
a computer model of the brain. The main objective is to develop a system to perform
various computational tasks faster than the traditional systems. These tasks include
pattern recognition and classification, approximation, optimization, and data
clustering.
• The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single
real number.
• Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into Batch gradient descent, stochastic gradient descent,
and mini-batch gradient descent.
• Regularization is a process of introducing additional information in order to solve an ill-

posed problem or to prevent overfitting
• A multilayer perceptron (MLP) is a feed forward artificial neural network that
generates a set of outputs from a set of inputs. An MLP is characterized by several
layers of input nodes connected as a directed graph between the input nodes
connected as a directed graph between the input and output layers.
24
• Backpropagation is the essence of neural network training. It is the method of fine-
tuning the weights of a neural network based on the error rate obtained in the previous
epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and
make the model reliable by increasing its generalization.
The chain rule is applied extensively by the backpropagation algorithm in order to calculate the
error gradient of the loss function with respect to each weight.
25

Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

2 NEURAL NETWORKS

2.1Artificial Neural Network

2.2 Single Layer Neural Network

2.3 Gradient Descent

2.4 Gradient Descent extension and regularization

2.5 Multilayer Perceptron

2.7 Chain rule

2.8 Deep learning model

After studying this unit, you will be able to:

A Brief History of ANN

Some key developments of this era are as follows −

1969 − Multilayer perceptron MLPMLP was invented by Minsky and Papert.

1971 − Kohonen developed Associative memories.

Some key developments of this era are as follows −

2.2 SINGLE LAYER NEURAL NETWORK

Fig 2.2 Gradient Descent

Direction & Learning Rate

Fig 2.4 Learning Rate

Types of Gradient Descent

2.4 GRADIENT DESCENT EXTENSION AND REGULARIZATION

Fig 2.8 Nesterov Momentum

2.5 MULTILAYER PERCEPTRON

layers of the perceptron.

The pictorial representation of multi-layer perceptron learning is as shown below-

Fig 2.9 Multi-Layer perceptron

2.7 CHAIN RULE

The chain rule allows us to find the derivative of composite functions.

2.8 DEEP LEARNING MODEL

There are two types of model:

Fig 2.11 Supervised vs Unsupervised Models

Classic Neural Networks (Multilayer Perceptrons)

Fig 2.12 Classic Neural Networks (Multilayer Perceptrons)

Fig 2.14 Boltzmann Machines

• Regularization is a process of introducing additional information in order to solve an ill-

You might also like