Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer

• The Basic Architecture of Neural Networks: Single Computational
Layer
https://www.upgrad.com/blog/neural-network-architecture-
components-algorithms/
Neural Networks are complex structures made of artificial

neurons that can take in multiple inputs to produce a single
output. This is the primary job of a Neural Network – to transform
input into a meaningful output. Usually, a Neural Network
consists of an input and output layer with one or multiple hidden
layers within.
In a Neural Network, all the neurons influence each other, and
hence, they are all connected.
• Single Layered Neural Network

https://www.geeksforgeeks.org/single-layered-neural-
networks-in-r-programming/
A single-layered neural network often called perceptrons is a type

of feed-forward neural network made up of input and output
layers. Inputs provided are multi-dimensional. Perceptrons are
acyclic in nature. The sum of the product of weights and the inputs
is calculated in each node. The input layer transmits the signals to
the output layer. The output layer performs computations.
Perceptron can learn only a linear function and requires less
training output. The output can be represented in one or two
values
Advantages of Single-layered Neural Network

1. Single layer neural networks are easy to set up and train
them as there is absence of hidden layers
2. It has explicit links to statistical models
Disadvantages of Single-layered Neural Network
1. It can work better only for linearly separable data.
2. Single layer neural network has low accuracy as compared
to multi-layer neural network.
• Multilayer Neural Networks
https://www.geeksforgeeks.org/multi-layered-neural-
networks-in-r-programming/
To be accurate a fully connected Multi-Layered Neural Network is

known as Multi-Layer Perceptron. A Multi-Layered Neural
Network consists of multiple layers of artificial neurons or nodes.
Unlike Single-Layer Neural Network, in recent times most of the
networks have Multi-Layered Neural Network. The following
diagram is a visualization of a multi-layer neural network.
Here the nodes marked as “1” are known as bias units. The
leftmost layer or Layer 1 is the input layer, the middle layer or
Layer 2 is the hidden layer and the rightmost layer or Layer 3 is
the output layer. It can say that the above diagram has 3 input
units (leaving the bias unit), 1 output unit, and 3 hidden units.
A Multi-layered Neural Network is the typical example of the Feed
Forward Neural Network. The number of neurons and the number
of layers consists of the hyperparameters of Neural Networks
which need tuning. In order to find ideal values for the
hyperparameters, one must use some cross-validation
techniques. Using the Back-Propagation technique, weight
adjustment training is carried out.
• Historical Trends in Deep Learning
• Book 26
It is easiest to understand deep learning with some historical
context. Rather than
providing a detailed history of deep learning, we identify a few
key trends:
•Deep learning has had a long and rich history, but has gone
by many names reflecting different philosophical viewpoints,
and has waxed and waned in popularity.
•Deep learning has become more useful as the amount of
available training data has increased.
•Deep learning models have grown in size over time as
computer hardware and software infrastructure for deep
learning has improved.
•Deep learning has solved increasingly complicated
applications with increasing accuracy over time.
• Deep Feedforward Networks

https://towardsdatascience.com/deep-learning-feedforward-
neural-network-26a6705dbdc7
Deep feedforward networks, also often called feedforward neural

networks, or multilayer perceptrons(MLPs), are the quintessential
deep learning models. These models are called feedforward
because information flows through the function being evaluated
from x, through the intermediate computations used to define f,
and finally to the output y. There are no feedback connections in
which outputs of the model are fed back into itself.
• Generalization
https://www.kdnuggets.com/2019/11/generalization-neural-
networks.html
Whenever we train our own neural networks, we need to take

care of something called the generalization of the neural network.
This essentially means how good our model is at learning from the
given data and applying the learnt information elsewhere.
When training a neural network, there’s going to be some data
that the neural network trains on, and there’s going to be some
data reserved for checking the performance of the neural
network. If the neural network performs well on the data which it
has not trained on, we can say it has generalized well on the given
data.
• regularization
https://www.analyticsvidhya.com/blog/2018/04/fundamentals-
deep-learning-regularization-techniques/
Regularization is a technique which makes slight modifications to

the learning algorithm such that the model generalizes better.
This in turn improves the model’s performance on the unseen
data as well.
regularization helps in reducing overfitting
Different Regularization Techniques in Deep Learning
1. L2 & L1 regularization
L1 and L2 are the most common types of regularization. These
update the general cost function by adding another term
known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization
term
Due to the addition of this regularization term, the values of
weight matrices decrease because it assumes that a neural
network with smaller weight matrices leads to simpler models.
Therefore, it will also reduce overfitting to quite an extent.
However, this regularization term differs in L1 and L2.
In L2, we have:
Here, lambda is the regularization parameter. It is the

hyperparameter whose value is optimized for better results. L2
regularization is also known as weight decay as it forces the
weights to decay towards zero (but not exactly zero).
In L1, we have:
2. Dropout
This is the one of the most interesting types of
regularization techniques. It also produces very good results
and is consequently the most frequently used regularization
technique. At every iteration, it randomly selects some
nodes and removes them along with all of their incoming
and outgoing connections
• Optimization for Training Deep Learning Models

https://www.kdnuggets.com/2020/12/optimization-algorithms-
neural-networks.html
In deep learning, we have the concept of loss, which tells us how

poorly the model is performing at that current instant. Now we
need to use this loss to train our network such that it performs
better. Essentially what we need to do is to take the loss and try
to minimize it, because a lower loss means our model is going to
perform better. The process of minimizing (or maximizing) any
mathematical expression is called optimization.
Optimizers are algorithms or methods used to change the
attributes of the neural network such as weights and learning rate
to reduce the losses. Optimizers are used to solve optimization
problems by minimizing the function.
different types of optimizers:

1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam
1. Gradient Descent is an optimization algorithm for

finding a local minimum of a differentiable function.
Gradient descent is simply used to find the values of a
function's parameters (coefficients) that minimize a
cost function as far as possible.
• Deep Transfer Learning

https://towardsdatascience.com/what-is-deep-transfer-
learning-and-why-is-it-becoming-so-popular-91acdcc2717a
Transfer learning is an approach in deep learning (and machine

learning) where knowledge is transferred from one model to
another.
Using transfer learning, we are able to solve a particular task using
full or part of an already pre-trained model on a different task.
When to use transfer learning?
The reasons are:
1. Lack of data
Deep learning models require a LOT of data for solving a task
effectively. However, it is not often the case that so much data
is available. For example, a company may wish to build a very
specific spam filter to its internal communication system but
does not possess lots of labelled data.
In that case, a specific target task can be solved using a pre-
trained model for a similar source task.
2. Speed
Transfer learning cuts a large percentage of training time and
allows for building various solutions instantly. In addition, it
prevents from setting up a complex and costly Cloud GPU/TPU.
3. Social good
Using transfer learning positively impacts the environment.
• Deep Reinforcement Learning.

https://bernardmarr.com/default.asp?contentID=1902
Deep reinforcement learning is a category of machine learning and

artificial intelligence where intelligent machines can learn from their
actions similar to the way humans learn from experience. Inherent
in this type of machine learning is that an agent is rewarded or
penalised based on their actions. Actions that get them to the target
outcome are rewarded
https://en.wikipedia.org/wiki/Deep_reinforcement_learning
Deep reinforcement learning (deep RL) is a subfield of machine

learning that combines reinforcement learning (RL) and deep
learning. RL considers the problem of a computational agent learning
to make decisions by trial and error. Deep RL incorporates deep
learning into the solution, allowing agents to make decisions from
unstructured input data without manual engineering of the state
space. Deep RL algorithms are able to take in very large inputs and
decide what actions to perform to optimize an objective
• Gradient Descent
https://www.geeksforgeeks.org/gradient-descent-algorithm-
and-its-variants/
Gradient Descent is an optimization algorithm used for

minimizing the cost function in various machine learning
algorithms. It is basically used for updating the parameters of
the learning model.
Types of gradient Descent:
1. Batch Gradient Descent: This is a type of gradient
descent which processes all the training examples for
each iteration of gradient descent. But if the number of
training examples is large, then batch gradient
descent is computationally very expensive. Hence if
the number of training examples is large, then batch
gradient descent is not preferred. Instead, we prefer to
use stochastic gradient descent or mini-batch gradient
descent.
2. Stochastic Gradient Descent: This is a type of
gradient descent which processes 1 training example
per iteration. Hence, the parameters are being
updated even after one iteration in which only a single
example has been processed. Hence this is quite
faster than batch gradient descent. But again, when
the number of training examples is large, even then it
processes only one example which can be additional
overhead for the system as the number of iterations
will be quite large.
3. Mini Batch gradient descent: This is a type of
gradient descent which works faster than both batch
gradient descent and stochastic gradient descent.
Here b examples where b<m are processed per
iteration. So even if the number of training examples
is large, it is processed in batches of b training
examples in one go. Thus, it works for larger training
examples and that too with lesser number of iterations.
• Introduction to Neural Network

A neural network is a group of connected I/O units where
each connection has a weight associated with its computer
programs. It helps you to build predictive models from large
databases. This model builds upon the human nervous
system. It helps you to conduct image understanding,
human learning, computer speech, etc.
• Multilayer Perceptron
https://en.wikipedia.org/wiki/Multilayer_perceptron
A multilayer perceptron (MLP) is a class of feedforward

artificial neural network (ANN). The term MLP is used
ambiguously, sometimes loosely to any feedforward ANN,
sometimes strictly to refer to networks composed of multiple
layers of perceptrons
An MLP consists of at least three layers of nodes: an input
layer, a hidden layer and an output layer. Except for the
input nodes, each node is a neuron that uses a nonlinear
activation function. MLP utilizes a supervised learning
technique called backpropagation for training
• Back Propagation Learning.

https://www.guru99.com/backpropogation-neural-
network.html
Backpropagation is the essence of neural network training.

It is the method of fine-tuning the weights of a neural
network based on the error rate obtained in the previous
epoch (iteration). Proper tuning of the weights allows you to
reduce error rates and make the model reliable by
increasing its generalization.
Backpropagation in neural network is a short form for
"backward propagation of errors." It is a standard method of
training artificial neural networks. This method helps
calculate the gradient of a loss function with respect to all
the weights in the network.
• Module No. 3
• CNN: Basic structure of CNN:
https://www.analyticsvidhya.com/blog/2020/10/what-is-
the-convolutional-neural-network-architecture/
convolution: it is a mathematical way of

convolutional neural network (CNN/ConvNet) is a class of
deep neural networks, most commonly applied to analyze
visual imagery. It is also known as shift invariant based on
the shared-weight architecture of the convolution kernels or
filters that slide along input features and provide translation
equivariant responses known as feature maps.
filter is a set of weights in a matrix applied on an image or
a matrix to obtain the required features
suppose this in the input matrix of 5×5 and a filter of matrix

3X3
Here the input is of size 5×5 after applying a 3×3 kernel or
filters you obtain a 3×3 output feature map
Padding
While applying convolutions we will not obtain the output
dimensions the same as input we will lose data over borders
so we append a border of zeros and recalculate the
convolution covering all the input values.
https://editor.analyticsvidhya.com/uploads/99433dnn4.gif
Striding
Some times we do not want to capture all the data or
information available so we skip some neighboring cells
https://editor.analyticsvidhya.com/uploads/21732Screenshot
(167).png
Pooling
In general terms pooling refers to a small portion, so here we
take a small portion of the input and try to take the average
value referred to as average pooling or take a maximum value
termed as max pooling, so by doing pooling on an image we
are not taking out all the values we are taking a summarized
value over all the values present
https://editor.analyticsvidhya.com/uploads/54575dnn6.png
Activation function
The activation function is a node that is put at the end of or in
between Neural Networks. They help to decide if the neuron
would fire or not
• Training a Convolutional Network
https://www.analyticsvidhya.com/blog/2021/05/convoluti
onal-neural-networks-cnn/
Convolutional neural networks are composed of multiple

layers of artificial neurons. Artificial neurons, a rough imitation
of their biological counterparts, are mathematical functions
that calculate the weighted sum of multiple inputs and outputs
an activation value. When you input an image in a ConvNet,
each layer generates several activation functions that are
passed on to the next layer.
The first layer usually extracts basic features such as
horizontal or diagonal edges. This output is passed on to the
next layer which detects more complex features such as
corners or combinational edges. As we move deeper into the
network it can identify even more complex features such as
objects, faces, etc.
Based on the activation map of the final convolution layer, the
classification layer outputs a set of confidence scores (values
between 0 and 1) that specify how likely the image is to belong
to a “class.”
• https://victorzhou.com/blog/intro-to-cnns-part-2/
Training a neural network typically consists of two phases:
1. A forward phase, where the input is passed completely

through the network.
2. A backward phase, where gradients are backpropagated
(backprop) and weights are updated.
• During the forward phase, each layer will cache any data (like
inputs, intermediate values, etc) it’ll need for the backward
phase. This means that any backward phase must be
preceded by a corresponding forward phase.
• During the backward phase, each layer will receive a
gradient and also return a gradient. It will receive the
gradient of loss with respect to its outputs (∂L /∂out ) and
return the gradient of loss with respect to its inputs (∂L /∂in
).
• Applications of Convolutional Networks

https://www.flatworldsolutions.com/data-
science/articles/7-applications-of-convolutional-neural-
networks.php
1. Decoding Facial Recognition

2. Analyzing Documents
3. Historic and Environmental Collections
4. Understanding Climate
5. Grey Areas
6. Advertising
• Autoencoders
https://www.jeremyjordan.me/autoencoders/
An autoencoder is a neural network that is trained to attempt to
copy its input to its output. Internally, it has a hidden layer h that
describes a code used to represent the input. The network may
be viewed as consisting of two parts: an encoder function h = f(x)
and a decoder that produces a reconstruction r = g(h).
OR
Autoencoders are an unsupervised learning technique in which
we leverage neural networks for the task of representation
learning, it imposes a bottleneck in the network which forces a
compressed knowledge representation of the original input.
OR
An autoencoder is a neural network architecture capable of
discovering structure within data in order to develop a
compressed representation of the input
If the input features were each independent of one another, this

compression and subsequent reconstruction would be a very
difficult task. However, if some sort of structure exists in the data,
this structure can be learned and consequently leveraged when
forcing the input through the network's bottleneck.
https://www.jeremyjordan.me/content/images/2018/03/Screen
-Shot-2018-03-06-at-3.17.13-PM.png
Undercomplete autoencoder
One way to obtain useful features from the autoencoder is to
constrain h to have a smaller dimension than. An autoencoder
whose code dimension is less than the input dimension is called
undercomplete. Learning an undercomplete representation
forces the autoencoder to capture the most salient features of the
training data.
The learning process is described simply as minimizing a loss
function
L(x, g(f(x)))
The simplest architecture for constructing an autoencoder is to

constrain the number of nodes present in the hidden layer(s) of
the network, limiting the amount of information that can flow
through the network. By penalizing the network according to the
reconstruction error, our model can learn the most important
attributes of the input data and how to best reconstruct the
original input from an "encoded" state. Ideally, this encoding will
learn and describe latent attributes of the input data.
‫صورة الشبكة‬
https://www.jeremyjordan.me/content/images/2018/03/Screen
-Shot-2018-03-07-at-8.24.37-AM.png
Because neural networks are capable of learning nonlinear
relationships, this can be thought of as a more powerful
(nonlinear) generalization of PCA. Whereas PCA attempts to
discover a lower dimensional hyperplane which describes the
original data, autoencoders are capable of learning nonlinear
manifolds.
An undercomplete autoencoder has no explicit regularization
term - we simply train our model according to the reconstruction
loss. Thus, our only way to ensure that the model isn't memorizing
the input data is the ensure that we've sufficiently restricted the
number of nodes in the hidden layer
For deep autoencoders, we must also be aware of the capacity of
our encoder and decoder models. Even if the "bottleneck layer" is
only one hidden node, it's still possible for our model to memorize
the training data provided that the encoder and decoder models
have sufficient capability to learn some arbitrary function which
can map the data to an index.
Given the fact that we'd like our model to discover latent
attributes within our data, it's important to ensure that the
autoencoder model is not simply learning an efficient way to
memorize the training data. Similar to supervised learning
problems, we can employ various forms of regularization to the
network in order to encourage good generalization properties;
these techniques are discussed below.
• Regularized Autoencoders
Undercomplete autoencoders, with code dimension less than the
input dimension, can learn the most salient features of the data
distribution. We have seen that these autoencoders fail to learn
anything useful if the encoder and decoder are given too much
capacity.
A similar problem occurs if the hidden code is allowed to have
dimension equal to the input, and in the overcomplete case in
which the hidden code has dimension greater than the input. In
these cases, even a linear encoder and linear decoder can learn to
copy the input to the output without learning anything useful
about the data distribution.
Ideally, one could train any architecture of autoencoder
successfully, choosing the code dimension and the capacity of the
encoder and decoder based on the complexity of distribution to
be modeled. Regularized autoencoders provide the ability to do
so. Rather than limiting the model capacity by keeping the
encoder and decoder shallow and the code size small, regularized
autoencoders use a loss function that encourages the model to
have other properties besides the ability to copy its input to its
output. These other properties include sparsity of the
representation, smallness of the derivative of the representation,
and robustness to noise or to missing inputs. A regularized
autoencoder can be nonlinear and overcomplete but still learn
something useful about the data distribution even if the model
capacity is great enough to learn a trivial identity function.
• Stochastic Encoders and Decoders

• Book 526
Autoencoders are just feedforward networks. The same loss
functions and output unit types that can be used for traditional
feedforward networks are also used for autoencoders.
a general strategy for designing the output units and the loss
function of a feedforward network is to define an output
distribution p(y | x) and minimize the negative log-likelihood −log
p(y | x). In that setting, y was a vector of targets, such as class
labels.
In the case of an autoencoder, x is now the target as well as the
input. However. Given a hidden code h, we may think of the
decoder as providing a conditional distribution pdecoder(x | h). We
may then train the autoencoder by minimizing −log pdecoder(x | h).
The exact form of this loss function will change depending on the
form of pdecoder . As with traditional feedforward networks, we
usually use linear output units to parametrize the mean of a
Gaussian distribution if x is real-valued. In that case, the negative
log-likelihood yields a mean squared error criterion. Similarly,
binary x values correspond to a Bernoulli distribution whose
parameters are given by a sigmoid output unit, discrete x values
correspond to a softmax distribution, and so on.
Typically, the output variables are treated as being conditionally
independent given h so that this probability distribution is
inexpensive to evaluate, but some techniques such as mixture
density outputs allow tractable modeling of outputs with
correlations.
To make a more radical departure from the feedforward networks

we have seen previously, we can also generalize the notion of an
encoding function f(x) to an encoding distribution pencoder(h | x), as
illustrated in Figgure . Any latent variable model pmodel(h, x)
defines a stochastic encoder
In general, the encoder and decoder distributions are not
necessarily conditional distributions compatible with a unique
joint distribution pmodel(x, h). Training the encoder and decoder
as a denoising autoencoder will tend to make them compatible
asymptotically.
• Denoising Autoencoders
Denoising Autoencoders :The denoising autoencoder (DAE) is
an autoencoder that receives a corrupted data point as input and
is trained to predict the original, uncorrupted data point as its
output.
C(x` | x) which represents a conditional distribution over
the computational graph of the cost function for a denoising

autoencoder, which is trained to reconstruct the clean data point
x from its corrupted version x`.
This is accomplished by minimizing the loss L = - log Pdecoder (2 |
h = f(x`)), where x` is a corrupted version of the data example x,
obtained through a given corruption process C(x` | x). Typically,
the distribution Pdecoder is a factorial distribution whose mean
parameters are emitted by a feedforward network g.
corrupted samples x`, given a data sample x. The autoencoder
then learns a
reconstruction distribution Preconstruct (x | x`) estimated from
training pairs (x, x`),
as follows:
Sample a training example x from the training data.
Sample a corrupted version x from C(x` | x = x).
Use (x, x` ) as a training example for estimating the autoencoder
reconstruction distribution Preconstruct (x | x`) = Pdecoder (x | h)
with h the output of encoder f(x`) and Pdecoder typically defined
by a decoder g(h).
Typically we can simply perform gradient-based approximate

minimization on the negative log-likelihood - log Pdecoder(x | h).
So long as the encoder is deterministic, the denoising
autoencoder is a feedforward network and may be trained with
exactly the same techniques as any other feedforward network.
Rather than adding a penalty Ω to the cost function, we can

obtain an autoencoder that learns something useful by
changing the reconstruction error term of the cost function.
Traditionally, autoencoders minimize some function
L(x, g(f (x)))
where L is a loss function penalizing g(f (x)) for being
dissimilar from x, such as the L2 norm of their difference.
This encourages go f to learn to be merely an identity
function if they have the capacity to do so.
A denoising autoencoder or DAE instead minimizes
L(x, g(f (x)))
where a is a copy of a that has been corrupted by some form
of noise. Denoising autoencoders must therefore undo this
corruption rather than simply copying their input.
Denoising autoencoders thus provide yet another example
of how useful properties can emerge as a byproduct of
minimizing reconstruction error. They are also an example
of how overcomplete, high-capacity models may be used as
autoencoders so long as care is taken to prevent them from
learning the identity function.
• Contractive Autoencoders
https://www.i2tutorials.com/explain-about-the-contractive-
autoencoders/
A contractive autoencoder is an unsupervised deep learning

technique that helps a neural network encode unlabeled
training data. Contractive autoencoder (CAE) objective is to
have a robust learned representation which is less sensitive
to small variation in the data.
Robustness of the representation for the data is done by

applying a penalty term to the loss function. The penalty
term is Frobenius norm of the Jacobian matrix. Frobenius
norm of the Jacobian matrix for the hidden layer is
calculated with respect to input. Frobenius norm of the
Jacobian matrix is the sum of square of all elements.
CAE surpasses results obtained by regularizing autoencoder

using weight decay or by denoising. CAE is a better choice
than denoising autoencoder to learn useful feature
extraction.
Penalty term generates mapping which are strongly

contracting the data and hence the name contractive
autoencoder.
denoising autoencoders make the reconstruction function

resist small but
finite-sized perturbations of the input, while contractive
autoencoders make the feature extraction function resist
infinitesimal perturbations of the input.
• Predictive Sparse Decomposition
Book page 541
Predictive sparse decomposition (PSD) is a model that is a
hybrid of sparse coding and parametric autoencoders. A
parametric encoder is trained to predict the output of
iterative inference. PSD has been applied to unsupervised
feature learning for object recognition in images and video,
as well as for audio. The model consists of an encoder f(x )
and a decoder g(h) that are both parametric. During training,
h is controlled by the optimization algorithm. Training
proceeds by minimizing
Like in sparse coding, the training algorithm alternates

between minimization with respect to h and minimization
with respect to the model parameters. Minimization with
respect to h is fast because f(x) provides a good initial value
of h and the cost function constrains h to remain near f (x)
anyway. Simple gradient descent can obtain reasonable
values of h in as few as ten steps.
The training procedure used by PSD is different from first

training a sparse
coding model and then training f (x) to predict the values of
the sparse coding features. The PSD training procedure
regularizes the decoder to use parameters for which f (x) can
infer good code values.
Predictive sparse coding is an example of learned

approximate inference. PSD can be interpreted as training a
directed sparse coding probabilistic model by maximizing a
lower bound on the log-likelihood of the model.
In practical applications of PSD, the iterative optimization is
only used during training. The parametric encoder f is used
to compute the learned features when the model is
deployed. Evaluating f is computationally inexpensive
compared to inferring h via gradient descent. Because f is a
differentiable parametric function, PSD models may be
stacked and used to initialize a deep network to be trained
with another criterion.
• Applications of Autoencoders
Book page 542
Autoencoders have been successfully applied to

dimensionality reduction and information retrieval tasks. It
was one of the early motivations for studying autoencoders.
Lower-dimensional representations can improve
performance on many tasks, such as classification. Models
of smaller spaces consume less memory and runtime.
One task that benefits even more than usual from

dimensionality reduction is information retrieval, the task of
finding entries in a database that resemble a query entry.
This task derives the usual benefits from dimensionality
reduction that other tasks do, but also derives the additional
benefit that search can become extremely efficient in certain
kinds of low dimensional spaces. Specifically, if we train the
dimensionality reduction algorithm to produce a code that is
low-dimensional and binary, then we can store all database
entries in a hash table mapping binary code vectors to
entries. This hash table allows us to perform information
retrieval by returning all database entries that have the same
binary code as the query. We can also search over slightly
less similar entries very efficiently, just by flipping individual
bits from the encoding of the query. This approach to
information retrieval via dimensionality reduction and
binarization is called semantic hashing, and has been
applied to both textual input and images.
To produce binary codes for semantic hashing, one typically

uses an encoding function with sigmoids on the final layer.
The sigmoid units must be trained to be saturated to nearly
0 or nearly 1 for all input values. One trick that can
accomplish this is simply to inject additive noise just before
the sigmoid nonlinearity during training. The magnitude of
the noise should increase over time. To fight that noise and
preserve as much information as possible, the network must
increase the magnitude of the inputs to the sigmoid function,
until saturation occurs. The idea of learning a hashing
function has been further explored in several directions,
including the idea of training the representations so as to
optimize a loss more directly linked to the task of finding
nearby examples in the hash table.
Module No. 4 Recurrent and Recursive Nets & Normalizations
• Recurrent Neural Networks
Book page 389
https://www.geeksforgeeks.org/introduction-to-recurrent-
neural-network/
Recurrent Neural Network(RNN) are a type of Neural

Network where the output from previous step are fed as
input to the current step. In traditional neural networks, all
the inputs and outputs are independent of each other, but in
cases like when it is required to predict the next word of a
sentence, the previous words are required and hence there
is a need to remember the previous words. Thus RNN came
into existence, which solved this issue with the help of a
Hidden Layer. The main and most important feature of RNN
is Hidden state, which remembers some information about
a sequence.
RNN have a “memory” which remembers all information
about what has been calculated. It uses the same
parameters for each input as it performs the same task on
all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural
networks.
How RNN works

Suppose there is a deeper network with one input layer, three
hidden layers and one output layer. Then like other neural
networks, each hidden layer will have its own set of weights
and biases, let’s say, for hidden layer 1 the weights and
biases are (w1, b1), (w2, b2) for second hidden layer and (w3,
b3) for third hidden layer. This means that each of these
layers are independent of each other, i.e. they do not
memorize the previous outputs.
https://media.geeksforgeeks.org/wp-content/uploads/d2-1.jpg
Now the RNN will do the following:
• RNN converts the independent activations into
dependent activations by providing the same weights
and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing
each previous outputs by giving each output as input
to the next hidden layer.
• Hence these three layers can be joined together such
that the weights and bias of all the hidden layers is the
same, into a single recurrent layer.
• Formula for calculating current state:

where:
ht -> current state
ht-1 -> previous state
xt -> input state
• Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
• Formula for calculating output:
Yt -> output
Why -> weight at output layer
Training through RNN

1. A single time step of the input is provided to the
network.
2. Then calculate its current state using set of current
input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the
problem and join the information from all the previous
states.
5. Once all the time steps are completed the final current
state is used to calculate the output.
6. The output is then compared to the actual output i.e
the target output and the error is generated.
7. The error is then back-propagated to the network to
update the weights and hence the network (RNN) is
trained.
Advantages of Recurrent Neural Network

1. An RNN remembers each and every information
through time. It is useful in time series prediction only
because of the feature to remember previous inputs
as well. This is called Long Short Term Memory.
2. Recurrent neural network are even used with
convolutional layers to extend the effective pixel
neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or
relu as an activation function.
• Unfolded RNNs
Book 391
A computational graph is a way to formalize the structure of
a set of computations, such as those involved in mapping
inputs and parameters to outputs and loss. the idea of
unfolding a recursive or recurrent computation into a
computational graph that has a repetitive structure, typically
corresponding to a chain of events. Unfolding this graph
results in the sharing of parameters across a deep network
structure.
For example, consider the classical form of a dynamical
system:
s(t) = f (s(t-1); Ѳ)
where s(t) is called the state of the system.
This equation is recurrent because the definition of s at time
t refers back to the same definition at time t - 1.
For a finite number of time steps t , the graph can be
unfolded by applying the definition t - 1 times. For example,
if we unfold the equation for t = 3 time steps, we obtain
s(3) = f (s(2); Ѳ)
= f (f (s(1); Ѳ); Ѳ)
Unfolding the equation by repeatedly applying the definition
in this way has yielded an expression that does not involve
recurrence. Such an expression can now be represented by
a traditional directed acyclic computational graph.
Each node represents the state at some time t and the

function f maps the state at t to the state at t + 1. The same
parameters are used for all time steps
• Bidirectional RNNs
Book 411
https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_
networks
Bidirectional recurrent neural networks (BRNN) connect two

hidden layers of opposite directions to the same output. With
this form of generative deep learning, the output layer can
get information from past (backwards) and future (forward)
states simultaneously.
Architecture
The principle of BRNN is to split the neurons of a regular RNN
into two directions, one for positive time direction (forward
states), and another for negative time
direction (backward states). Those two
states’ output are not connected to
inputs of the opposite direction states.
The general structure of RNN and
BRNN can be depicted in the right
diagram. By using two time directions,
input information from the past and future of the current
time frame can be used unlike standard RNN which requires
the delays for including future information
Training
BRNNs can be trained using similar algorithms to RNNs,
because the two directional neurons do not have any
interactions. However, when back-propagation through time
is applied, additional processes are needed because
updating input and output layers cannot be done at once.
General procedures for training are as follows: For forward
pass, forward states and backward states are passed first,
then output neurons are passed. For backward pass, output
neurons are passed first, then forward states and backward
states are passed next. After forward and backward passes
are done, the weights are updated.
• Encoder-Decoder Sequence-to-Sequence Architectures

Book 412
https://medium.com/analytics-vidhya/encoder-decoder-
seq2seq-models-clearly-explained-c34186fbf49b
Sequence-to-Sequence (Seq2Seq) problems is a special class
of Sequence Modelling Problems in which both, the input and
the output is a sequence. Encoder-Decoder models were
originally built to solve such Seq2Seq problems.
A typical sequence to sequence model has two parts – an
encoder and a decoder. Both the parts are practically two
different neural network models combined into one giant
network
Broadly, the task of an encoder network is to understand the

input sequence, and create a smaller dimensional
representation of it. This representation is then forwarded to a
decoder network which generates a sequence of its own that
represents the output.
‫من الكتاب‬
We often call the input to the RNN the "context." We want to

produce a representation of this context, C. The context C
might be a vector or sequence of vectors that summarize the
input sequence X = (x(1), . . ., x(n)).
The idea of encoder-decoder or sequence-to-sequence is very

simple:
1. an encoder or reader or input RNN processes the input

sequence. The encoder emits the context C, usually as a
simple function of its final hidden state.
2. a decoder or writer or output RNN is conditioned on that
fixed-length vector to generate the output sequence Y =
(y(1), . . ., y(n),).
The innovation of this kind of architecture over those presented

in earlier sections of this chapter is that the lengths nx and ny
can vary from each other, while previous architectures
constrained nx = ny =T. In a sequence-to-sequence
architecture, the two RNNs are trained jointly to maximize the
average of log
P(y(1) , . . ., y(n) | x(1), . .., x(n))
over all the pairs of x and y sequences in the training set. The
last state hn of the encoder RNN is typically used as a
representation C of the input sequence that is provided as input
to the decoder RNN. If the context C is a vector, then the
decoder RNN is simply a vector-to- sequence RNN.
there are at least two ways for a vector-to-sequence RNN to
receive input. The input can be provided as the initial state of
the RNN, or the input can be connected to the hidden units at
each time step. These two ways can also be combined.
There is no constraint that the encoder must have the same
size of hidden layer as the decoder
One clear limitation of this architecture is when the context C
output by the encoder RNN has a dimension that is too small to
properly summarize a long
• Deep Recurrent Networks

Book 414
The computation in most RNNs can be decomposed into three
blocks of parameters
and associated transformations:
1. from the input to the hidden state,
2. from the previous hidden' state to the next hidden state, and
3. from the hidden state to the output.
With the RNN architecture shown each of these three blocks is
associated with a single weight matrix, i.e.,
• When the network is unfolded, each
of these corresponds to a shallow
transformation.
• By a shallow Transformation we
mean a transformation that would
be represented a single layer within
a deep MLP.
• Typically this is a transformation represented by a learned
affine transformation followed by a fixed nonlinearity
Would it be advantageous to introduce depth into each of these

operations?
Experimental evidence strongly suggests so. That we need
enough depth in order to perform the required transformations
Ways of making an RNN deep

‫رسوم كل رسمة مع شرحها تحت‬
1. Recurrent states broken down into groups
We can think of lower levels of the hierarchy play a role
of transforming the raw input into a representation that
is more appropriate at the higher levels of the hidden
state
2. Deeper computation in hidden-to-hidden
Go a step further and propose to have a separate
MLP (possibly deep) for each of the three blocks:
1. From the input to the hidden state
2. From the previous hidden state to the next hidden
state
3. From the hidden state to the output
Considerations of representational capacity suggest that
to allocate enough capacity in each of these three steps
• But doing so by adding depth may hurt learning
by making optimization difficult
• In general it is easier to optimize shallower
architectures
• Adding the extra depth makes the shortest time
of a variable from time step t to a variable in time
step t+1 beome longer
3. Introducing skip connections

•For example, if an MLP with a single hidden layer is used
for the state-to state transition, we have doubled the
length of the shortest path between
variables in any two different time steps compared with
the ordinary RNN.
• This can be mitigated by introducing skip connections
in the hidden-to-hidden path as illustrated here
▪ Recursive Neural Networks

Recursive neural networks represent yet another generalization
of recurrent networks, with a different kind of
computational graph, which is structured as a deep
tree, rather than the chain-like structure of RNNs.
Recursive networks have been successfully applied

to the processing as input to neural nets, in natural
language processing as well as in computer vision.
One clear advantage of recursive nets over

recurrent nets is that for a sequence of the same length t, the
depth can be drastically reduced from t to o(log t), which might
help deal with long-term dependencies.
One option to get best structure the tree is to have a tree
structure which does not depend on the data.
such as a balanced binary tree. In some application domains,
external methods can suggest the appropriate tree structure.
Ideally, one would like the learner itself to discover and infer the
tree structure that is appropriate for any given input.
Many variants of the recursive net idea are possible. The
computation performed by each node does not have to be the
traditional artificial neuron computation.
▪ Echo State Networks

http://www.scholarpedia.org/article/Echo_state_network
Book 421
Echo state networks (ESN) provide an architecture and supervised
learning principle for recurrent neural networks (RNNs). The main
idea is
1. to drive a random, large, fixed recurrent neural network with
the input signal, thereby inducing in each neuron within this
"reservoir" network a nonlinear response signal, and 2. combine
a desired output signal by a trainable linear combination of all of
these response signals.
The basic idea of ESNs is shared with Liquid State Machines (LSM).
Increasingly often, LSMs, ESNs and the more recently explored
Backpropagation Decorrelation learning rule for RNNsare
subsumed under the name of Reservoir Computing.
The basic idea also informed a model of temporal input

discrimination in biological neural networks.
the ESN approach follows the steps.
Step 1: Provide a random RNN.

(i) Create a random dynamical reservoir RNN, using any
neuron model
(ii) (ii) Attach input units to the reservoir by creating
random all-to-all connections.
(iii) (iii) Create output units. If the task requires output
feedback, install randomly generated output-to-
reservoir connections (all-to-all). If the task does not
require output feedback, do not create any
connections to/from the output units in this step.
Step 2: Harvest reservoir states.

Drive the dynamical reservoir with the training data D for times
n=1,…,nmax . In tasks without output feedback, the reservoir is
driven by the input u(n) only. This results in a sequence x(n) of N-
dimensional reservoir states. Each component signal x(n) is a
nonlinear transform of the driving input.
Step 3: Compute output weights.

Compute the output weights as the linear regression weights of
the teacher outputs y(n) on the reservoir states x(n). Use these
weights to create reservoir-to-output connections. The training is
now completed and the ESN ready for use.
▪ The Long Short-Term Memory

https://en.wikipedia.org/wiki/Long_short-term_memory
Book 426
Long short-term memory (LSTM) is an artificial recurrent neural
network (RNN) architecture used in the field of deep learning. Unlike
standard feedforward neural networks, LSTM has feedback
connections. It can not only process single data points (such as
images), but also entire sequences of data (such as speech or video).
For example, LSTM is applicable to tasks such as unsegmented,
connected handwriting recognition, speech recognition and anomaly
detection in network traffic or IDSs (intrusion detection systems).
A common LSTM unit is composed of a cell, an input gate, an output

gate and a forget gate. The cell remembers values over arbitrary time
intervals and the three gates regulate the flow of information into
and out of the cell.
LSTM networks are well-suited to classifying, processing and making
predictions based on time series data, since there can be lags of
unknown duration between important events in a time series.
We can see the Block diagram of the
LSTM recurrent network "cell." Where
the Cells are connected recurrently to
each other, replacing the usual hidden
units of ordinary recurrent networks.
An input feature is computed with a
regular artificial neuron unit. Its value
can be accumulated into the state if
the sigmoidal input gate allows it. The
state unit has a linear self-loop whose
weight is controlled by the forget gate.
The output of the cell can be shut off by the output gate. All the
gating units have a sigmoid nonlinearity, while the input unit can
have any squashing nonlinearity. The state unit can also be used as
an extra input to the gating units. The black square indicates a delay
of 1 time unit.
LSTM recurrent networks have "LSTM cells" that have an internal

recurrence (a self-loop), in addition to the outer recurrence of the
RNN. Each cell has the same inputs and outputs as an ordinary
recurrent network, but has more parameters and a system of gating
units that controls the flow of information.
The most important component is the state unit s," that has a linear
self-loop similar to the leaky units described in the previous section.
However, here, the self-loop weight (or the associated time
constant) is controlled by a forget gate unit f. (for time step t and cell
i), that sets this weight to a value between 0 and 1 via a sigmoid unit:
where x(t) is the current input vector and his the current hidden layer
vector, containing the outputs of all the LSTM cells, and bf ,Uf , Wf are
respectively biases, input weights and recurrent weights for the
forget gates. The LSTM cell internal state is thus updated as follows,
but with a conditional self-loop weight fi(t)
where b, U and W respectively denote the biases, input weights and

recurrent weights into the LSTM cell. The external input gate unit g;"
is computed similarly to the forget gate, but with its own parameters:
The output hi(t)of the LSTM cell can also be shut off, via the output
gate gi(t)which also uses a sigmoid unit for gating:
▪ Gated Recurrent Units

Book 429
https://en.wikipedia.org/wiki/Gated_recurrent_unit
Gated recurrent units (GRUs) are a

gating mechanism in recurrent neural
networks, The GRU is like a long short-
term memory (LSTM) with a forget
gate, but has fewer parameters than
LSTM, as it lacks an output gate. GRU's performance on certain
tasks of polyphonic music modeling, speech signal modeling and
natural language processing was found to be similar to that of
LSTM. GRUs have been shown to exhibit better performance on
certain smaller and less frequent datasets.
The main difference with the LSTM is that a single gating unit
simultaneously controls the forgetting factor and the decision to
update the state unit. The update equations are the following
The reset and updates gates can individually "ignore" parts of the
state vector. The update gates act like conditional leaky
integrators that can linearly gate any dimension, thus choosing to
copy it or completely ignore it by replacing it by the new "target
state" value (towards which the leaky integrator wants to
converge). The reset gates control which parts of the state get
used to compute the next target state, introducing an additional
nonlinear effect in the relationship between past state and future
state.
▪ Applications of Recurrent Neural Networks

1. Machine Translation
2. Robot control
3. Time series prediction
4. Speech recognition
5. Speech synthesis
6. Time series anomaly detection
7. Handwriting recognition
▪ Batch Normalization
https://www.analyticsvidhya.com/blog/2021/03/introduction-
to-batch-normalization/
Batch Normalization: it is a process to make neural networks
faster and more stable through adding extra layers in a deep
neural network. The new layer performs the standardizing and
normalizing operations on the input of a layer coming from a
previous layer.
A typical neural network is trained using a collected set of input
data called batch. Similarly, the normalizing process in batch
normalization takes place in batches, not as a single input.
there are two-steps process. First, the input is normalized, and
later rescaling and offsetting is performed.
▪ Instance Normalization
https://medium.com/techspace-usict/normalization-techniques-
in-deep-neural-networks-9121bf100d8
Layer normalization and instance normalization is very similar to
each other but the difference between them is that instance
normalization normalizes across each channel in each training
example instead of normalizing across input features in an
training example. Unlike batch normalization, the instance
normalization layer is applied at test time as well(due to non-
dependency of mini-batch).
This technique is originally devised for style transfer, the problem
instance normalization tries to address is that the network should
be agnostic to the contrast of the original image.
▪ Group Normalization
https://medium.com/techspace-usict/normalization-techniques-
in-deep-neural-networks-9121bf100d8
Group Normalization normalizes over group of channels for each
training examples. We can say that, Group Norm is in between
Instance Norm and Layer Norm.
When we put all the channels into a single group, group
normalization becomes Layer normalization. And, when we put
each channel into different groups it becomes Instance
normalization.
Sᵢ is defined below.
Here, x is the feature computed by a layer, and i is an index. In the

case of 2D images, i = (iN , iC , iH, iW ) is a 4D vector indexing
the features in (N, C, H, W) order, where N is the batch axis, C is
the channel axis, and H and W are the spatial height and width
axes. G is the number of groups, which is a pre-defined hyper-
parameter. C/G is the number of channels per group. ⌊.⌋ is the floor
operation, and “⌊kC/(C/G)⌋= ⌊iC/(C/G)⌋” means that the indexes i
and k are in the same group of channels, assuming each group of
channels are stored in a sequential order along the C axis. GN
computes µ and σ along the (H, W) axes and along a group of C/G
channels.
▪ Difference between Batch Normalization Instance

Normalization
https://becominghuman.ai/all-about-normalization-
6ea79e70894b
In “Batch Normalization”, mean and variance are
calculated for each individual channel across all samples and both
spatial dimensions.
In “Instance Normalization”, mean and variance are
calculated for each individual channel for each individual
sample across both spatial dimensions.
▪ Module No. 5:
Deep Generative Models & Deep Learning architectures
• Boltzmann Machines
• Book 671
• https://en.wikipedia.org/wiki/Boltzmann_machine
Boltzmann machines were originally introduced as a general

"connectionist" approach to learning arbitrary probability
distributions over binary vectors.
define the Boltzmann machine over a d-dimensional binary
random vector x ∈ {0, 1)d. The Boltzmann machine is an energy-
based model, meaning we define the joint probability distribution
using an energy function:
where E(x) is the energy function and Z is the partition function
that ensures that
∑𝑥 𝑃(𝑥) = 1. The energy function of the Boltzmann machine is
given by
where U is the "weight" matrix of model parameters and b is the

vector of bias parameters.
In the general setting of the Boltzmann machine, we are given a

set of training examples, each of which are n-dimensional.
describes the joint probability distribution over the observed
variables. While this scenario is certainly viable, it does limit the
kinds of interactions between the observed variables to those
described by the weight matrix. Specifically, it means that the
probability of one unit being on is given by a linear model (logistic
regression) from the values of the other units.
The Boltzmann machine becomes more powerful when not all the
variables are observed. In this case, the non-observed variables,
or latent variables, can act similarly to hidden units in a multi-layer
perceptron and model higher-order interactions among the
visible units. Just as the addition of hidden units to convert logistic
regression into an MLP results in the MLP being a universal
approximator of functions, a Boltzmann machine with hidden
units is no longer limited to modeling linear relationships between
variables. Instead, the Boltzmann machine becomes a universal
approximator of probability mass functions over discrete
variables
Formally, we decompose the units a into two subsets: the visible
units v and the latent (or hidden) units h. The energy function
becomes
Boltzmann Machine Learning Learning algorithms for Boltzmann
machines are usually based on maximum likelihood. All
Boltzmann machines have an intractable partition function, so the
maximum likelihood gradient must be approximated.
• Restricted Boltzmann Machines

• Book 673
• https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine
A restricted Boltzmann machine (RBM) is a generative stochastic

artificial neural network that can learn a probability distribution
over its set of inputs. RBMs have found applications in
dimensionality reduction, classification, collaborative filtering,
feature learning, topic modelling and even many body quantum
mechanics. They can be trained in either supervised or
unsupervised ways, depending on the task. RBMs are a variant of
Boltzmann machines, with the restriction that their neurons must
form a bipartite graph: a pair of nodes from each of the two
groups of units (referred to as the "visible" and "hidden" units
respectively) may have a symmetric connection between them;
and there are no connections between nodes within a group.
RBMs are undirected probabilistic graphical models containing a
layer of observable variables and a single layer of latent variables.
RBMs may be stacked (one on top of the other) to form deeper
models.
the restricted Boltzmann machine is an energy-based model with
the joint probability distribution specified by its energy function:
It is apparent from the definition of the partition function Z that
the naive method of computing Z could be computationally
intractable, unless a cleverly designed algorithm could exploit
regularities in the probability distribution to compute Z faster. In
the case of restricted Boltzmann machines. The intractable
partition function Z implies that the normalized joint probability
distribution P(v) is also intractable to evaluate.
Advantages and Disadvantages of RBM
Advantages:
Expressive enough to encode any distribution and computationally
efficient.
Faster than traditional Boltzmann Machine due to the restrictions in
terms of connections between nodes.
Activations of the hidden layer can be used as input to other models
as useful features to improve performance
Disadvantages:
Training is more difficult as it is difficult to calculate the Energy
gradient function.
CD-k algorithm used in RBMs is not as familiar as the back
propagation algorithm.
Weight Adjustment
applications such as Dimensionality reduction, Recommender
systems, Topic modelling
• Stacking Restricted Boltzmann Machines:Deep Belief

Networks
• Book 677
https://en.wikipedia.org/wiki/Deep_belief_network
A deep belief network (DBN) is a generative graphical model, or
alternatively a class of deep neural network, composed of multiple
layers of latent variables ("hidden units"), with connections between
the layers but not between units within each layer.
When trained on a set of examples without supervision, a DBN can
learn to probabilistically reconstruct its inputs. The layers then act as
feature detectors. After this learning step, a DBN can be further
trained with supervision to perform classification.
DBNs can be viewed as a composition of simple, unsupervised
networks such as restricted Boltzmann machines (RBMs)or
autoencoders, where each sub-network's hidden layer serves as the
visible layer for the next. An RBM is an undirected, generative
energy-based model with a "visible" input layer and a hidden layer
and connections between but not within layers. This composition
leads to a fast, layer-by-layer unsupervised training procedure,
where contrastive divergence is applied to each sub-network in turn,
starting from the "lowest" pair of layers (the lowest visible layer is a
training set).
Deep belief networks are generative models with several layers of

latent variables. The latent variables are typically binary, while the
visible units may be binary or real. There are no intra-layer
connections. Usually, every unit in each layer is connected to every
unit in each neighboring layer, though it is possible to construct more
sparsely connected DBNs. The connections between the top two
layers are undirected. The connections between all other layers are
directed, with the arrows pointed toward the layer that is closest to
the data.
A DBN with L hidden layers contains L weight matrices: W(1), ..., W(L).
It also contains L+ 1 bias vectors: b(0), ..., b(L), with b(0) providing the
biases for the visible layer. The probability distribution represented
by the DBN is given by with ß diagonal for tractability.
To generate a sample from a DBN, we first run several steps of Gibbs
sampling on the top two hidden layers. This stage is essentially
drawing a sample from the RBM defined by the top two hidden
layers. We can then use a single pass of ancestral sampling through
the rest of the model to draw a sample from the visible units.
https://medium.com/@icecreamlabs/deep-belief-networks-all-you-
need-to-know-68aa9a71cc53
Training a Deep Belief Network

The first step is to train a layer of properties which can obtain the
input signals from the pixels directly. The next step is to treat the
values of this layer as pixels and learn the features of the previously
obtained features in a second hidden layer. Every time another layer
of properties or features is added to the belief network, there will be
an improvement in the lower bound on the log probability of the
training data set.
Example :
• Deep Boltzmann Machines

• 680
• https://en.wikipedia.org/wiki/Boltzmann_machine
•
A deep Boltzmann machine or DBM is another kind of deep,
generative model. Unlike the deep belief network (DBN), it is
an entirely undirected model. Unlike the RBM, the DBM has
several layers of latent variables (RBMs have just one). But like
the RBM, within each layer, each of the variables are mutually
independent, conditioned on the variables in the neighboring
layers. Deep Boltzmann machines have been applied to a
variety of tasks including document modeling.
A DBM is an energy-based model, meaning that the joint
probability distribution over the model variables is
parametrization E. In the case of a deep Boltzmann machine
with one visible layer, v, and three hidden layers, h(1), h(2) and
h(3), the joint probability is given by:
In comparison to the RBM energy function, the DBM energy
function includes connections between the hidden units in the
form of the weight matrices (W(2) and W()). these connections
have significant consequences for both the model behavior as
well as how we go about performing inference in the model. In
comparison to fully connected Boltzmann machines, the DBM
offers some advantages that are similar to those offered by the
RBM. the DBM layers can be organized into a bipartite graph,
with odd layers on one side and even layers on the other.
The bipartite structure of the DBM means that we can apply
equations to determine the conditional distributions in a DBM.
The units within a layer are conditionally independent from
each other given the values of the neighboring layers, so the
distributions over binary variables can be fully described by the
Bernoulli parameters giving the probability of each unit being
active. the activation probabilities are given by:
The bipartite structure makes Gibbs sampling in a deep Boltzmann

machine efficient. The naive approach to Gibbs sampling is to update
only one variable at a time. RBMs allow all of the visible units to be
updated in one block and all of the hidden units to be updated in a
second block. One might naively assume that a DBM with L layers
requires L+ 1 updates, with each iteration updating a block consisting
of one layer of units. Instead, it is possible to update all of the units
in only two iterations. Gibbs sampling can be divided into two blocks
of updates, one including all even layers and the other including all
odd layers. Due to the bipartite DBM connection pattern, given the
even layers, the distribution over the odd layers is factorial and thus
can be sampled simultaneously and independently as a block.
Likewise, given the odd layers, the even layers can be sampled
simultaneously and independently as a block. Efficient sampling is
especially important for training with the stochastic maximum
likelihood algorithm.
• Directed Generative Nets

• Book 709
directed graphical models make up a prominent class of graphical
models. While directed graphical models have been very popular
within the greater machine learning community, within the
smaller deep learning community they have until roughly 2013
been overshadowed by undirected models such as the RBM.
deep belief networks are a partially directed model.
sparse coding models are shallow directed generative models.
They are often used as feature learners in the context of deep
learning, though they tend to perform poorly at sample
generation and density estimation.
Some fully directed models:
1. Sigmoid Belief Network
2. Variantional Autoencoders
3. Generative Adverserial Network
• Variational Autoencoder
• Book 713
https://deep-learning-study-
note.readthedocs.io/en/latest/Part%203%20(Deep%20Learning
%20Research)/20%20Deep%20Generative%20Models/20.10%20
Directed%20Generative%20Nets.html
‫من الموقع‬
The variational antoencoder or VAE is directed model that uses
learned approximate inference and can be trained purely with
gradient-based algorithm. Process of generating samples from the
model:
Draw a sample z from a distribution pmodel(z)
The sample z run through a differentiable generator network g(z)
x is sampled from distribution pmodel(x;g(z))=pmodel(x|z)
During training, the approximate inference network (or encoder)
q(z | x) is used to obtain z, and pmodel(x|z) is then viewed as a
decoder network.
Review on variational lower bound
Introduce q as an arbitrary distribution over h
The main idea behind variational autoencoder is to traina

parametric encoder that produce the parameters of q. As long as
z is a continuous variable, we can then back-prop through samples
of z drawn from q(z|x)=q(z;f(x;θ)) to obtain a gradient with repect
to θ. Learning then consists solely of maximizing L with respect to
the parameters of the encoder and decoder. All the expectation
in L may be approximated by Monte Carlo Sampling.
Drawback:
1. Samples from variational autoencoders trained on images tend
to be somewhat blurry.
2. Tend to use only a small subset of the dimensions of z, as if the
encoder were not abe to transform enough of the local
distribution in input space where the marginal distribution
matches the factorized prior
VAE framework is straightforward to extend to a wide range of
model architectures. This is a key advantage over Boltzman
Machines
Deep recurrent attention writer (DRAW) uses a recurrent encoder
and recurrent decoder combined with an attention machanism.
Generate squences by defining variational RNN by usin a
recurrent encoder and decoder within the VAE framework
Importance-weighted antoencoder
Comparison:
VAE
The variantional autoencoder is defined for arbitrary
computational graphs, which makes it applicable to a wider range
of probabistic model families because there is no need to restrict
the choice of models to those with tractable mean field fixed-
point equitions.
Has the advantage of increasing a bound on the log-likehood of
the model
learns an inference for only one problem, inferring z given x
MP-DBM and other approaches that involve back-prop through
the approximate inference
require an inference procedure such as mean field fixed-point
equitions to provide the computational graph.
more heuristic and have little probablistic interpretation beyond
making the results of approximate inference accurate.
Are able to perform approximate inference over any subset of
variables given any other subset of variables, beacuse mean field
fixed-point equition specify how to share parameters between
the computational graphs for all the different problems.
Nice property of VAE: Simultaneously training a parametric
encoder in combination with the generator network forces the
model to learn a predictable coordinate system that the encoder
can capture.
• Generative Adversarial Networks (GANs)

• Book 717
• https://deep-learning-study-
note.readthedocs.io/en/latest/Part%203%20(Deep%20Learni
ng%20Research)/20%20Deep%20Generative%20Models/20.1
0%20Directed%20Generative%20Nets.html
• ‫من الموقع‬
Generative adversarial networks or GANs are another
generative modeling approach based on differentiable
generator networks. Generative adversarial networks are
based on a game theoretic scenario in which the generator
network must compete against an adversary.
• Generator: directly produces samples x=g(z;θg)

• Discriminator: attempts to distinguish between samples
from the trianing data and samples drawn from the
generator. The discriminator emits a probability value
given by d(x;θd), indicating the probability that x is a
real training example rather than a fake samples drawn
from the model.

The main motivation for the design of GAN is that learning process
requires neither approximate inference nor approximation of
partition function gredient.
Learning in GANs can be difficult in practice when g and d are
represented by neural networks and maxdv(g,d)isnotconvex. Note
the equilibria for minimax game are not local minima of v. Instead,
they are points that are simulteneously minima for both player’s
cost. This means that they are saddle points of v that are local minima
with respect to the first player’s parameters and local maxima with
respect to the second player’s parameters. It is possible for two
players to take turns increasing then decreasing v forever, rather
than landing exactly on the saddle point, where neither player is
capable of reducing its cost.
Stablization of GAN learning remains an open problem. Fortunately,
GAN learning perform well when the model architecture and
hyperparameters are carefully selected.
GAN’s training procedure can fit probability distribution that assign
0 probability to the training points. Rather than maximizing the log-
probability of specific points, the generator net learns to trace out a
mamifold whose points resemble training points in some way.
Units in discriminator should be stochastically dropped while
computing the gradient for the generator network to follow.
• Object Detection
https://en.wikipedia.org/wiki/Object_detection
Object detection is a computer technology related to computer

vision and image processing that deals with detecting instances of
semantic objects of a certain class (such as humans, buildings, or
cars) in digital images and videos. Well-researched domains of object
detection include face detection and pedestrian detection. Object
detection has applications in many areas of computer vision,
including image retrieval and video surveillance.
Every object class has its own special features that helps in classifying
the class – for example all circles are round. Object class detection
uses these special features. For example, when looking for circles,
objects that are at a particular distance from a point (i.e. the center)
are sought. Similarly, when looking for squares, objects that are
perpendicular at corners and have equal side lengths are needed. A
similar approach is used for face identification where eyes, nose, and
lips can be found and features like skin color and distance between
eyes can be found.
Methods
Methods for object detection generally fall into either neural
network-based or non-neural approaches. For non-neural
approaches, it becomes necessary to first define features, then using
a technique such as support vector machine (SVM) to do the
classification. On the other hand, neural techniques are able to do
end-to-end object detection without specifically defining features,
and are typically based on convolutional neural networks (CNN).
• ELM Extreme learning machines

• https://en.wikipedia.org/wiki/Extreme_learning_machine
Extreme learning machines are feedforward neural networks for
classification, regression, clustering, sparse approximation,
compression and feature learning with a single layer or multiple
layers of hidden nodes, where the parameters of hidden nodes (not
just the weights connecting inputs to hidden nodes) need not be
tuned. These hidden nodes can be randomly assigned and never
updated (i.e. they are random projection but with nonlinear
transforms), or can be inherited from their ancestors without being
changed.
In most cases, the output weights of hidden nodes are usually
learned in a single step, which essentially amounts to learning a
linear model.
these models are able to produce good generalization performance
and learn thousands of times faster than networks trained using
backpropagation. it also shows that these models can outperform
support vector machines in both classification and regression
applications.
Architectures:
In most cases, ELM is used as a single hidden layer feedforward
network (SLFN) including but not limited to sigmoid networks, RBF
networks, threshold networks, fuzzy inference networks, complex
neural networks, wavelet networks, Fourier transform, Laplacian
transform, etc. Due to its different learning algorithm
implementations for regression, classification, sparse coding,
compression, feature learning and clustering, multi ELMs have been
used to form multi hidden layer networks, deep learning or
hierarchical networks.
A hidden node in ELM is a computational element, which need not
be considered as classical neuron. A hidden node in ELM can be
classical artificial neurons, basis functions, or a subnetwork formed
by some hidden nodes.
Algorithms:
Module No. 6
• Large-Scale Deep Learning

• Book 460
A large population of the neurons or features acting together
can exhibit intelligent behavior. One of the key factors
responsible for the improvement in neural network’s accuracy
and the improvement of the complexity of tasks they can solve
is the dramatic increase in the size of the networks we use.
Because the size of neural networks is of paramount
importance, deep learning requires high performance
hardware and software infrastructure.
1. Fast CPU Implementations:
researchers worked hard to demonstrate that CPUs could
not manage the high computational workload required by
neural networks.
careful implementation for specific CPU families can yield
large improvements. the best CPUs available could run
neural network workloads faster when using fixed-point
arithmetic rather than floating-point arithmetic.
The important principle is that careful specialization of
numerical computation routines can yield a large payoff.
2. GPU Implementations:
‫اذا ما في وقت اكتب الفقرة الثالثة اهم‬
Most modern neural network implementations are based
on graphics processing units. Graphics processing units
(GPUs) are specialized hardware components that were
originally developed for graphics applications.
Video game rendering requires performing many

operations in parallel quickly. Graphics cards must perform
matrix multiplication and division on many vertices in
parallel to convert these 3-D coordinates into 2-D on-screen
coordinates. The graphics card must then perform many
computations at each pixel in parallel to determine the
color of each pixel. In both cases, the computations are
fairly simple and do not involve much branching compared
to the computational workload that a CPU usually
encounters.
Neural network algorithms require the same performance

characteristics as the real-time graphics algorithms. Neural
networks usually involve large and numerous buffers of
parameters, activation values, and gradient values, each of
which must be completely updated during every step of
training. These buffers are large enough to fall outside the
cache of a traditional desktop computer so the memory
bandwidth of the system often becomes the rate limiting
factor. GPUs offer a compelling advantage over CPUs due to
their high memory bandwidth. Neural network training
algorithms typically do not involve much branching or
sophisticated control, so they are appropriate for GPU
hardware. Since neural networks can be divided into
multiple individual “neurons” that can be processed
independently from the other neurons in the same layer,
neural networks easily benefit from the parallelism of GPU
computing.
3. Large Scale Distributed Implementations:
In many cases, the computational resources available on a
single machine are insufficient. We therefore want to
distribute the workload of training and inference across
many machines.
Distributing inference is simple, because each input
example we want to process can be run by a separate
machine. This is known as data parallelism. It is also possible
to get model parallelism, where multiple machines work
together on a single datapoint, with each machine running
a different part of the model. This is feasible for both
inference and training. Data parallelism during training is
somewhat harder. We can increase the size of the
minibatch used for a single SGD step, but usually we get less
than linear returns in terms of optimization performance. It
would be better to allow multiple machines to compute
multiple gradient descent steps in parallel. the standard
definition of gradient descent is as a completely sequential
algorithm: the gradient at step is a function of the
parameters t produced by step t − 1.
4. Model Compression:
In many applications, it is much more important that the
time and memory cost of running inference in a machine
learning model be low than that the time and memory cost
of training be low. it is possible to train a model once, then
deploy it to be used by billions of users. In many cases, the
end user is more resource-constrained than the developer.
A key strategy for reducing the cost of inference is model
compression.
The basic idea of model compression is to replace the
original, expensive model with a smaller model that
requires less memory and runtime to store and evaluate.
Model compression is applicable when the size of the
original model is driven primarily by a need to prevent
overfitting. In most cases, the model with the lowest
generalization error is an ensemble of several
independently trained models. Evaluating all n ensemble
members is expensive. Sometimes, even a single model
generalizes better if it is large.
These large models learn some function f(x), but do so using
many more parameters than are necessary for the task.
Their size is necessary only due to the limited number of
training examples. As soon as we have fit this function f(x),
we can generate a training set containing infinitely many
examples, simply by applying f to randomly sampled points
x. We then train the new, smaller, model to match f (x) on
these points.
it is best to sample the new x points from a distribution
resembling the actual test inputs that will be supplied to the
model later.
Alternatively, one can train the smaller model only on the
original training points, but train it to copy other features of
the model, such as its posterior distribution over the
incorrect classes.
5. Dynamic Structure
6. Specialized Hardware Implementations of Deep Networks
• Computer Vision:
• Book 469
The most popular standard tasks for deep learning
algorithms are forms of object recognition or optical
character recognition.
Applications of computer vision range from reproducing
human visual abilities, such as recognizing faces, recognize
sound waves from the vibrations they induce in objects
visible in a video. Most deep learning for computer vision is
used for object recognition or detection of some form,
whether this means reporting which object is present in an
image, annotating an image with bounding boxes around
each object, transcribing a sequence of symbols from an
image, or labeling each pixel in an image with the identity
of the object it belongs to.
Preprocessing
Many application areas require sophisticated preprocessing
because the original input comes in a form that is difficult
for many deep learning architectures to represent. The
images should be standardized so that their pixels all lie in
the same, reasonable range, like [0,1] or [-1, 1]. Mixing
images that lie in [0,1] with images that lie in [0, 255] will
usually result in failure, so images must be cropped or
scaled to fit that size. However, even this rescaling is not
always strictly necessary. Some convolutional models
accept variably-sized inputs and dynamically adjust the size
of their pooling regions to keep the output size constant.
Dataset augmentation may be seen as a way of
preprocessing the training set only. Dataset augmentation
is an excellent way to reduce the generalization error of
most computer vision models.
Other kinds of preprocessing are applied to both the train
and the test set with the goal of putting each example into
a more canonical form in order to reduce the amount of
variation that the model needs to account for. Reducing the
amount of variation in the data can both reduce
generalization error and reduce the size of the model
needed to fit the training set.
Preprocessing for a small dataset is usually designed to
remove some kind of variability in the input data. the
preprocessing for large datasets is often unnecessary, and it
is best to just let the model learn which kinds of variability
it should become invariant to.
o Dataset Augmentation:
it is easy to improve the generalization of a classifier
by increasing the size of the training set by adding
extra copies of the training examples that have been
modified with transformations that do not change the
class. Object recognition is a classification task that is
especially amenable to this form of dataset
augmentation because the class is invariant to so
many transformations and the input can be easily
transformed with many geometric operations.
classifiers can benefit from random translations,
rotations, and in some cases, flips of the input to
augment the dataset. In specialized computer vision
applications, more advanced transformations are
commonly used for dataset augmentation. These
schemes include random perturbation of the colors in
an image and nonlinear geometric distortions of the
input.
o Contrast Normalization:
Contrast simply refers to the magnitude of the
difference between the bright and the dark pixels in
an image. In the context of deep learning, contrast
usually refers to the standard deviation of the pixels
in an image or region of an image.
Suppose we have an image represented by a tensor
Global contrast normalization (GCN) aims to prevent
images from having varying amounts of contrast by
subtracting the mean from each image, then rescaling
it so that the standard deviation across its pixels is
equal to some constant s.
Images with very low but non-zero contrast often
have little information content. Dividing by the true
standard deviation usually accomplishes nothing
more than amplifying sensor noise or compression
artifacts in such cases.
this motivates introducing a small, positive
regularization parameter λ to bias the estimate of the
standard deviation. Alternately, one can constrain the

denominator
Datasets consisting of large images cropped to

interesting objects are unlikely to contain any images
with nearly constant intensity. In these cases, it is safe
to practically ignore the small denominator problem
by setting λ = 0 and avoid division by 0 in extremely
rare cases by setting ꞓ to an extremely low value like
10−8. Small images cropped randomly are more likely
to have nearly constant intensity, making aggressive
regularization more useful. used ꞓ = 0 and λ = 10 on
small, randomly selected. The scale parameter s can
usually be set to 1, or chosen to make each individual
pixel have standard deviation across examples close
to 1.
‫أكثر شرح‬
Global contrast normalization will often fail to
highlight image features we would like to stand out,
such as edges and corners. If we have a scene with a
large dark area and a large bright area then global
contrast normalization will ensure there is a large
difference between the brightness of the dark area
and the brightness of the light area. It will not,
however, ensure that edges within the dark region
stand out
This motivates local contrast normalization. Local
contrast normalization ensures that the contrast is
normalized across each small window, rather than
over the image as a whole.
In all cases, one modifies each pixel by subtracting a
mean of nearby pixels and dividing by a standard
deviation of nearby pixels. In some cases, this is
literally the mean and standard deviation of all pixels
in a rectangular window centered on the pixel to be
modified. In other cases, this is a weighted mean and
weighted standard deviation using Gaussian weights
centered on the pixel to be modified. In the case of
color images, some strategies process different color
channels separately while others combine
information from different channels to normalize
each pixel.
Local contrast normalization can usually be
implemented efficiently by using separable
convolution to compute feature maps of local means
and local standard deviations, then using element-
wise subtraction and element-wise division on
different feature maps.
Local contrast normalization is a differentiable
operation and can also be used as a nonlinearity
applied to the hidden layers of a network, as well as a
preprocessing operation applied to the input.
As with global contrast normalization, we typically
need to regularize local contrast normalization to
avoid division by zero. In fact, because local contrast
normalization typically acts on smaller windows, it is
even more important to regularize. Smaller windows
are more likely to contain values that are all nearly the
same as each other, and thus more likely to have zero
standard deviation.

Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer

Uploaded by

Copyright:

Available Formats

Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer

Uploaded by

Copyright:

Available Formats

• The Basic Architecture of Neural Networks: Single Computational

Neural Networks are complex structures made of artificial

• Single Layered Neural Network

A single-layered neural network often called perceptrons is a type

Advantages of Single-layered Neural Network

To be accurate a fully connected Multi-Layered Neural Network is

• Deep Feedforward Networks

Deep feedforward networks, also often called feedforward neural

Whenever we train our own neural networks, we need to take

Regularization is a technique which makes slight modifications to

Here, lambda is the regularization parameter. It is the

• Optimization for Training Deep Learning Models

In deep learning, we have the concept of loss, which tells us how

different types of optimizers:

1. Gradient Descent is an optimization algorithm for

• Deep Transfer Learning

Transfer learning is an approach in deep learning (and machine

• Deep Reinforcement Learning.

Deep reinforcement learning is a category of machine learning and

Deep reinforcement learning (deep RL) is a subfield of machine

Gradient Descent is an optimization algorithm used for

• Introduction to Neural Network

A multilayer perceptron (MLP) is a class of feedforward

• Back Propagation Learning.

Backpropagation is the essence of neural network training.

• CNN: Basic structure of CNN:

convolution: it is a mathematical way of

suppose this in the input matrix of 5×5 and a filter of matrix

• Training a Convolutional Network

Convolutional neural networks are composed of multiple

Training a neural network typically consists of two phases:

1. A forward phase, where the input is passed completely

• Applications of Convolutional Networks

1. Decoding Facial Recognition

If the input features were each independent of one another, this

The simplest architecture for constructing an autoencoder is to

• Stochastic Encoders and Decoders

To make a more radical departure from the feedforward networks

the computational graph of the cost function for a denoising

Typically we can simply perform gradient-based approximate

Rather than adding a penalty Ω to the cost function, we can

A contractive autoencoder is an unsupervised deep learning

Robustness of the representation for the data is done by

CAE surpasses results obtained by regularizing autoencoder

Penalty term generates mapping which are strongly

denoising autoencoders make the reconstruction function

Like in sparse coding, the training algorithm alternates

The training procedure used by PSD is different from first

Predictive sparse coding is an example of learned

Autoencoders have been successfully applied to

One task that benefits even more than usual from

To produce binary codes for semantic hashing, one typically

Recurrent Neural Network(RNN) are a type of Neural

How RNN works

• Formula for calculating current state:

• Formula for applying Activation function(tanh):

Training through RNN

Advantages of Recurrent Neural Network

Each node represents the state at some time t and the