Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
Layer
https://www.upgrad.com/blog/neural-network-architecture-
components-algorithms/
Here the nodes marked as “1” are known as bias units. The
leftmost layer or Layer 1 is the input layer, the middle layer or
Layer 2 is the hidden layer and the rightmost layer or Layer 3 is
the output layer. It can say that the above diagram has 3 input
units (leaving the bias unit), 1 output unit, and 3 hidden units.
A Multi-layered Neural Network is the typical example of the Feed
Forward Neural Network. The number of neurons and the number
of layers consists of the hyperparameters of Neural Networks
which need tuning. In order to find ideal values for the
hyperparameters, one must use some cross-validation
techniques. Using the Back-Propagation technique, weight
adjustment training is carried out.
• Historical Trends in Deep Learning
• Book 26
It is easiest to understand deep learning with some historical
context. Rather than
providing a detailed history of deep learning, we identify a few
key trends:
•Deep learning has had a long and rich history, but has gone
by many names reflecting different philosophical viewpoints,
and has waxed and waned in popularity.
•Deep learning has become more useful as the amount of
available training data has increased.
•Deep learning models have grown in size over time as
computer hardware and software infrastructure for deep
learning has improved.
•Deep learning has solved increasingly complicated
applications with increasing accuracy over time.
• Generalization
https://www.kdnuggets.com/2019/11/generalization-neural-
networks.html
• regularization
https://www.analyticsvidhya.com/blog/2018/04/fundamentals-
deep-learning-regularization-techniques/
1. L2 & L1 regularization
L1 and L2 are the most common types of regularization. These
update the general cost function by adding another term
known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization
term
Due to the addition of this regularization term, the values of
weight matrices decrease because it assumes that a neural
network with smaller weight matrices leads to simpler models.
Therefore, it will also reduce overfitting to quite an extent.
However, this regularization term differs in L1 and L2.
In L2, we have:
2. Dropout
This is the one of the most interesting types of
regularization techniques. It also produces very good results
and is consequently the most frequently used regularization
technique. At every iteration, it randomly selects some
nodes and removes them along with all of their incoming
and outgoing connections
https://en.wikipedia.org/wiki/Deep_reinforcement_learning
• Gradient Descent
https://www.geeksforgeeks.org/gradient-descent-algorithm-
and-its-variants/
• Multilayer Perceptron
https://en.wikipedia.org/wiki/Multilayer_perceptron
• Module No. 3
https://www.analyticsvidhya.com/blog/2020/10/what-is-
the-convolutional-neural-network-architecture/
Padding
While applying convolutions we will not obtain the output
dimensions the same as input we will lose data over borders
so we append a border of zeros and recalculate the
convolution covering all the input values.
https://editor.analyticsvidhya.com/uploads/99433dnn4.gif
Striding
Some times we do not want to capture all the data or
information available so we skip some neighboring cells
https://editor.analyticsvidhya.com/uploads/21732Screenshot
(167).png
Pooling
In general terms pooling refers to a small portion, so here we
take a small portion of the input and try to take the average
value referred to as average pooling or take a maximum value
termed as max pooling, so by doing pooling on an image we
are not taking out all the values we are taking a summarized
value over all the values present
https://editor.analyticsvidhya.com/uploads/54575dnn6.png
Activation function
The activation function is a node that is put at the end of or in
between Neural Networks. They help to decide if the neuron
would fire or not
https://www.analyticsvidhya.com/blog/2021/05/convoluti
onal-neural-networks-cnn/
• https://victorzhou.com/blog/intro-to-cnns-part-2/
• Autoencoders
https://www.jeremyjordan.me/autoencoders/
An autoencoder is a neural network that is trained to attempt to
copy its input to its output. Internally, it has a hidden layer h that
describes a code used to represent the input. The network may
be viewed as consisting of two parts: an encoder function h = f(x)
and a decoder that produces a reconstruction r = g(h).
OR
Autoencoders are an unsupervised learning technique in which
we leverage neural networks for the task of representation
learning, it imposes a bottleneck in the network which forces a
compressed knowledge representation of the original input.
OR
An autoencoder is a neural network architecture capable of
discovering structure within data in order to develop a
compressed representation of the input
https://www.jeremyjordan.me/content/images/2018/03/Screen
-Shot-2018-03-06-at-3.17.13-PM.png
Undercomplete autoencoder
One way to obtain useful features from the autoencoder is to
constrain h to have a smaller dimension than. An autoencoder
whose code dimension is less than the input dimension is called
undercomplete. Learning an undercomplete representation
forces the autoencoder to capture the most salient features of the
training data.
The learning process is described simply as minimizing a loss
function
L(x, g(f(x)))
• Regularized Autoencoders
Undercomplete autoencoders, with code dimension less than the
input dimension, can learn the most salient features of the data
distribution. We have seen that these autoencoders fail to learn
anything useful if the encoder and decoder are given too much
capacity.
A similar problem occurs if the hidden code is allowed to have
dimension equal to the input, and in the overcomplete case in
which the hidden code has dimension greater than the input. In
these cases, even a linear encoder and linear decoder can learn to
copy the input to the output without learning anything useful
about the data distribution.
Ideally, one could train any architecture of autoencoder
successfully, choosing the code dimension and the capacity of the
encoder and decoder based on the complexity of distribution to
be modeled. Regularized autoencoders provide the ability to do
so. Rather than limiting the model capacity by keeping the
encoder and decoder shallow and the code size small, regularized
autoencoders use a loss function that encourages the model to
have other properties besides the ability to copy its input to its
output. These other properties include sparsity of the
representation, smallness of the derivative of the representation,
and robustness to noise or to missing inputs. A regularized
autoencoder can be nonlinear and overcomplete but still learn
something useful about the data distribution even if the model
capacity is great enough to learn a trivial identity function.
• Denoising Autoencoders
Denoising Autoencoders :The denoising autoencoder (DAE) is
an autoencoder that receives a corrupted data point as input and
is trained to predict the original, uncorrupted data point as its
output.
C(x` | x) which represents a conditional distribution over
https://www.i2tutorials.com/explain-about-the-contractive-
autoencoders/
• Applications of Autoencoders
Book page 542
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
• Formula for calculating output:
Yt -> output
Why -> weight at output layer
• Unfolded RNNs
Book 391
A computational graph is a way to formalize the structure of
a set of computations, such as those involved in mapping
inputs and parameters to outputs and loss. the idea of
unfolding a recursive or recurrent computation into a
computational graph that has a repetitive structure, typically
corresponding to a chain of events. Unfolding this graph
results in the sharing of parameters across a deep network
structure.
For example, consider the classical form of a dynamical
system:
s(t) = f (s(t-1); Ѳ)
where s(t) is called the state of the system.
This equation is recurrent because the definition of s at time
t refers back to the same definition at time t - 1.
For a finite number of time steps t , the graph can be
unfolded by applying the definition t - 1 times. For example,
if we unfold the equation for t = 3 time steps, we obtain
s(3) = f (s(2); Ѳ)
= f (f (s(1); Ѳ); Ѳ)
Unfolding the equation by repeatedly applying the definition
in this way has yielded an expression that does not involve
recurrence. Such an expression can now be represented by
a traditional directed acyclic computational graph.
• Bidirectional RNNs
Book 411
https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_
networks
Architecture
The principle of BRNN is to split the neurons of a regular RNN
into two directions, one for positive time direction (forward
states), and another for negative time
direction (backward states). Those two
states’ output are not connected to
inputs of the opposite direction states.
The general structure of RNN and
BRNN can be depicted in the right
diagram. By using two time directions,
input information from the past and future of the current
time frame can be used unlike standard RNN which requires
the delays for including future information
Training
BRNNs can be trained using similar algorithms to RNNs,
because the two directional neurons do not have any
interactions. However, when back-propagation through time
is applied, additional processes are needed because
updating input and output layers cannot be done at once.
General procedures for training are as follows: For forward
pass, forward states and backward states are passed first,
then output neurons are passed. For backward pass, output
neurons are passed first, then forward states and backward
states are passed next. After forward and backward passes
are done, the weights are updated.
over all the pairs of x and y sequences in the training set. The
last state hn of the encoder RNN is typically used as a
representation C of the input sequence that is provided as input
to the decoder RNN. If the context C is a vector, then the
decoder RNN is simply a vector-to- sequence RNN.
there are at least two ways for a vector-to-sequence RNN to
receive input. The input can be provided as the initial state of
the RNN, or the input can be connected to the hidden units at
each time step. These two ways can also be combined.
There is no constraint that the encoder must have the same
size of hidden layer as the decoder
One clear limitation of this architecture is when the context C
output by the encoder RNN has a dimension that is too small to
properly summarize a long
The basic idea of ESNs is shared with Liquid State Machines (LSM).
Increasingly often, LSMs, ESNs and the more recently explored
Backpropagation Decorrelation learning rule for RNNsare
subsumed under the name of Reservoir Computing.
where x(t) is the current input vector and his the current hidden layer
vector, containing the outputs of all the LSTM cells, and bf ,Uf , Wf are
respectively biases, input weights and recurrent weights for the
forget gates. The LSTM cell internal state is thus updated as follows,
but with a conditional self-loop weight fi(t)
The output hi(t)of the LSTM cell can also be shut off, via the output
gate gi(t)which also uses a sigmoid unit for gating:
The main difference with the LSTM is that a single gating unit
simultaneously controls the forgetting factor and the decision to
update the state unit. The update equations are the following
The reset and updates gates can individually "ignore" parts of the
state vector. The update gates act like conditional leaky
integrators that can linearly gate any dimension, thus choosing to
copy it or completely ignore it by replacing it by the new "target
state" value (towards which the leaky integrator wants to
converge). The reset gates control which parts of the state get
used to compute the next target state, introducing an additional
nonlinear effect in the relationship between past state and future
state.
▪ Instance Normalization
https://medium.com/techspace-usict/normalization-techniques-
in-deep-neural-networks-9121bf100d8
Layer normalization and instance normalization is very similar to
each other but the difference between them is that instance
normalization normalizes across each channel in each training
example instead of normalizing across input features in an
training example. Unlike batch normalization, the instance
normalization layer is applied at test time as well(due to non-
dependency of mini-batch).
This technique is originally devised for style transfer, the problem
instance normalization tries to address is that the network should
be agnostic to the contrast of the original image.
▪ Group Normalization
https://medium.com/techspace-usict/normalization-techniques-
in-deep-neural-networks-9121bf100d8
Group Normalization normalizes over group of channels for each
training examples. We can say that, Group Norm is in between
Instance Norm and Layer Norm.
When we put all the channels into a single group, group
normalization becomes Layer normalization. And, when we put
each channel into different groups it becomes Instance
normalization.
Sᵢ is defined below.
▪ Module No. 5:
Deep Generative Models & Deep Learning architectures
• Boltzmann Machines
• Book 671
• https://en.wikipedia.org/wiki/Boltzmann_machine
https://medium.com/@icecreamlabs/deep-belief-networks-all-you-
need-to-know-68aa9a71cc53
• Variational Autoencoder
• Book 713
https://deep-learning-study-
note.readthedocs.io/en/latest/Part%203%20(Deep%20Learning
%20Research)/20%20Deep%20Generative%20Models/20.10%20
Directed%20Generative%20Nets.html
من الموقع
The variational antoencoder or VAE is directed model that uses
learned approximate inference and can be trained purely with
gradient-based algorithm. Process of generating samples from the
model:
Draw a sample z from a distribution pmodel(z)
The sample z run through a differentiable generator network g(z)
x is sampled from distribution pmodel(x;g(z))=pmodel(x|z)
During training, the approximate inference network (or encoder)
q(z | x) is used to obtain z, and pmodel(x|z) is then viewed as a
decoder network.
Review on variational lower bound
Introduce q as an arbitrary distribution over h
Drawback:
1. Samples from variational autoencoders trained on images tend
to be somewhat blurry.
2. Tend to use only a small subset of the dimensions of z, as if the
encoder were not abe to transform enough of the local
distribution in input space where the marginal distribution
matches the factorized prior
VAE framework is straightforward to extend to a wide range of
model architectures. This is a key advantage over Boltzman
Machines
Deep recurrent attention writer (DRAW) uses a recurrent encoder
and recurrent decoder combined with an attention machanism.
Generate squences by defining variational RNN by usin a
recurrent encoder and decoder within the VAE framework
Importance-weighted antoencoder
Comparison:
VAE
The variantional autoencoder is defined for arbitrary
computational graphs, which makes it applicable to a wider range
of probabistic model families because there is no need to restrict
the choice of models to those with tractable mean field fixed-
point equitions.
Has the advantage of increasing a bound on the log-likehood of
the model
learns an inference for only one problem, inferring z given x
MP-DBM and other approaches that involve back-prop through
the approximate inference
require an inference procedure such as mean field fixed-point
equitions to provide the computational graph.
more heuristic and have little probablistic interpretation beyond
making the results of approximate inference accurate.
Are able to perform approximate inference over any subset of
variables given any other subset of variables, beacuse mean field
fixed-point equition specify how to share parameters between
the computational graphs for all the different problems.
Nice property of VAE: Simultaneously training a parametric
encoder in combination with the generator network forces the
model to learn a predictable coordinate system that the encoder
can capture.
• Object Detection
https://en.wikipedia.org/wiki/Object_detection
Algorithms:
Module No. 6
• Computer Vision:
• Book 469
The most popular standard tasks for deep learning
algorithms are forms of object recognition or optical
character recognition.
Applications of computer vision range from reproducing
human visual abilities, such as recognizing faces, recognize
sound waves from the vibrations they induce in objects
visible in a video. Most deep learning for computer vision is
used for object recognition or detection of some form,
whether this means reporting which object is present in an
image, annotating an image with bounding boxes around
each object, transcribing a sequence of symbols from an
image, or labeling each pixel in an image with the identity
of the object it belongs to.
Preprocessing
Many application areas require sophisticated preprocessing
because the original input comes in a form that is difficult
for many deep learning architectures to represent. The
images should be standardized so that their pixels all lie in
the same, reasonable range, like [0,1] or [-1, 1]. Mixing
images that lie in [0,1] with images that lie in [0, 255] will
usually result in failure, so images must be cropped or
scaled to fit that size. However, even this rescaling is not
always strictly necessary. Some convolutional models
accept variably-sized inputs and dynamically adjust the size
of their pooling regions to keep the output size constant.
Dataset augmentation may be seen as a way of
preprocessing the training set only. Dataset augmentation
is an excellent way to reduce the generalization error of
most computer vision models.
Other kinds of preprocessing are applied to both the train
and the test set with the goal of putting each example into
a more canonical form in order to reduce the amount of
variation that the model needs to account for. Reducing the
amount of variation in the data can both reduce
generalization error and reduce the size of the model
needed to fit the training set.
Preprocessing for a small dataset is usually designed to
remove some kind of variability in the input data. the
preprocessing for large datasets is often unnecessary, and it
is best to just let the model learn which kinds of variability
it should become invariant to.
o Dataset Augmentation:
it is easy to improve the generalization of a classifier
by increasing the size of the training set by adding
extra copies of the training examples that have been
modified with transformations that do not change the
class. Object recognition is a classification task that is
especially amenable to this form of dataset
augmentation because the class is invariant to so
many transformations and the input can be easily
transformed with many geometric operations.
classifiers can benefit from random translations,
rotations, and in some cases, flips of the input to
augment the dataset. In specialized computer vision
applications, more advanced transformations are
commonly used for dataset augmentation. These
schemes include random perturbation of the colors in
an image and nonlinear geometric distortions of the
input.
o Contrast Normalization:
Contrast simply refers to the magnitude of the
difference between the bright and the dark pixels in
an image. In the context of deep learning, contrast
usually refers to the standard deviation of the pixels
in an image or region of an image.
Suppose we have an image represented by a tensor
Global contrast normalization (GCN) aims to prevent
images from having varying amounts of contrast by
subtracting the mean from each image, then rescaling
it so that the standard deviation across its pixels is
equal to some constant s.
Images with very low but non-zero contrast often
have little information content. Dividing by the true
standard deviation usually accomplishes nothing
more than amplifying sensor noise or compression
artifacts in such cases.
this motivates introducing a small, positive
regularization parameter λ to bias the estimate of the