Deep Learning
Deep Learning
Department of CSE
(Emerging Technologies)
(Specify the branch names for core subjects)
B.TECH(R-20 Regulation)
(IV YEAR – I SEM)
(2023-24)
Deep Learning
(R20A6610)
LECTURE NOTES
On
30.06.2023
Updated by
**NAME, Designation
**(Second-time Prepared Faculty if updated the existed Lecture Notes)
On
31.07.2023
Department of Computer Science and Engineering
EMERGING TECHNOLOGIES
Vision
Mission
The department of CSE (Emerging Technologies) is committed to:
To offer highest Professional and Academic Standards in terms of Personal growth and
satisfaction.
Make the society as the hub of emerging technologies and thereby capture opportunities in new
age technologies.
To create a benchmark in the areas of Research, Education and Public Outreach.
To provide students a platform where independent learning and scientific study are encouraged
with emphasis on latest engineering techniques.
QUALITY POLICY
To pursue continual improvement of teaching learning process of Undergraduate and Post
Graduate programs in Engineering & Management vigorously.
To provide state of art infrastructure and expertise to impart the quality education and research
environment to students for a complete learning experiences.
1. To understand the basic concepts and techniques of Deep Learning and the need of Deep
Learningtechniques in real-world problems
2. To understand CNN algorithms and the way to evaluate performance of the CNN architectures.
3. To apply RNN and LSTM to learn, predict and classify the real-world problems in the paradigms
of DeepLearning.
4. To understand, learn and design GANs for the selected problems.
5. To understand the concept of Auto-encoders and enhancing GANs using auto-encoders.
UNIT-I:
INTRODUCTION TO DEEP LEARNING: Historical Trends in Deep Learning,
Why DL is Growing, Artificial Neural Network, Non-linear classification
example using Neural Networks: XOR/XNOR, Single/Multiple Layer
Perceptron, Feed Forward Network, Deep Feed- forward networks,
Stochastic Gradient –Based learning, Hidden Units, Architecture Design,
Back- Propagation, Deep learning frameworks and libraries (e.g.,
TensorFlow/Keras, PyTorch).
UNIT-II:
CONVOLUTION NEURAL NETWORK (CNN): Introduction to CNNs and
their applications in computer vision, CNN basic architecture, Activation
functions-sigmoid, tanh, ReLU, Softmax layer, Types of pooling layers,
Training of CNN in TensorFlow, various popular CNN architectures: VGG,
Google Net, ResNet etc, Dropout, Normalization, Data augmentation
UNIT-III
RECURRENT NEURAL NETWORK (RNN): Introduction to RNNs and their
applications in sequential data analysis, Back propagation through time
(BPTT), Vanishing Gradient Problem,gradient clipping Long Short Term
Memory (LSTM) Networks, Gated Recurrent Units, Bidirectional LSTMs,
Bidirectional RNNs.
UNIT- IV
GENERATIVE ADVERSARIAL NETWORKS (GANS): Generative models,
Concept and principles of GANs, Architecture of GANs (generator and
discriminator networks), Comparison between discriminative and
generative models, Generative Adversarial Networks (GANs),
Applicationsof GANs
UNIT- V
AUTO-ENCODERS: Auto-encoders, Architecture and components of auto-
encoders (encoder and decoder), Training an auto-encoder for data
compression and reconstruction, Relationship between Autoencoders
and GANs, Hybrid Models: Encoder-Decoder GANs.
TEXT BOOKS:
1. Deep Learning : An MIT Press Book by Ian Goodfellow and Yoshua Bengio Aaron Courville.
2. Michael Nielson,Neural Networks and Deep Learning,Determination Press,2015.
3. Satish kumar,Neural networks:A classroom Approach,Tata McGraw-Hill Education,2004
REFERENCES:
1. Deep Learning with Python, Francois Chollet, Manning publications 2018
2. Advanced Deep Learning with Keras, Rowel Atienza, PACKT Publications 2018
COURSE OUTCOMES:
CO1: Understand the basic concepts and techniques of Deep Learning
and the need of Deep Learning techniques in real-world problems.
CO2: Understand CNN algorithms and t h e w ay t o e v alu at e
p e r f o r m anc e o f t h e C N N architectures.
CO3: Apply RNN and LSTM to learn, predict and classify the real-world
problems in the paradigms of Deep Learning.
CO4: Understand, learn and design GANs for the selected problems.
CO5: Understand the concept of Auto-encoders and enhancing GANs using
auto-encoders.
B.Tech – CSE (Emerging Technologies) R-20
UNIT-I:
INTRODUCTION TO DEEP LEARNING: Historical Trends in
Deep Learning, Why DL is Growing, Artificial Neural Network,
Non-linear classification example using Neural Networks:
XOR/XNOR, Single/Multiple Layer Perceptron, Feed Forward
Network, Deep Feed- forward networks, Stochastic Gradient
–Based learning, Hidden Units, Architecture Design, Back-
Propagation, Deep learning frameworks and libraries (e.g.,
TensorFlow/Keras, PyTorch).
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
become available.
Today Deep learning has become one of the most popular and visible areas of
machine learning, due to its success in a variety of applications, such as computer
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Artificial neurons, also known as units, are found in artificial neural networks. The
whole Artificial Neural Network is composed of these artificial neurons, which are
arranged in a series of layers. The complexities of neural networks will depend on
the complexities of the underlying patterns in the dataset whether a layer has a
dozen units or millions of units. Commonly, Artificial Neural Network has an input
layer, an output layer as well as hidden layers. The input layer receives data from
the outside world which the neural network needs to analyze or learn about.
In a fully connected artificial neural network, there is an input layer and one or more
hidden layers connected one after the other. Each neuron receives input from the
previous layer neurons or the input layer. The output of one neuron becomes the
input to other neurons in the next layer of the network, and this process continues
until the final layer produces the output of the network. Then, after passing through
one or more hidden layers, this data is transformed into valuable data for the
output layer. Finally, the output layer provides an output in the form of an artificial
neural network’s response to the data that comes in.
Units are linked to one another from one layer to another in the bulk of neural
networks. Each of these links has weights that control how much one-unit
influences another. The neural network learns more and more about the data as it
moves from one unit to another, ultimately producing an output from the output
layer.
Difference between Machine Learning and Deep Learning:
Machine learning and deep learning both are subsets of artificial intelligence but
there are many similarities and differences between them.
Machine Learning Deep Learning
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Can work on the smaller amount of Requires the larger volume of dataset
dataset compared to machine learning
Takes less time to train the model. Takes more time to train the model.
Deep Learning models are able to automatically learn features from the data, which
makes them well-suited for tasks such as image recognition, speech recognition,
and natural language processing. The most widely used architectures in deep
learning are feedforward neural networks, convolutional neural networks (CNNs),
and recurrent neural networks (RNNs).
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear
flow of information through the network. FNNs have been widely used for tasks
such as image classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are specifically for image and video
recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object
detection, and image segmentation.
Recurrent Neural Networks (RNNs) are a type of neural network that is able to
process sequential data, such as time series and natural language. RNNs are able
to maintain an internal state that captures information about the previous inputs,
which makes them well-suited for tasks such as speech recognition, natural
language processing, and language translation.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in
computer vision include:
Object detection and recognition: Deep learning model can be used to
identify and locate objects within images and videos, making it possible
for machines to perform tasks such as self-driving cars, surveillance, and
robotics.
Image classification: Deep learning models can be used to classify
images into categories such as animals, plants, and buildings. This is
used in applications such as medical imaging, quality control, and image
retrieval.
Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify
specific features within images.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Process in ML/DL:
Artificial Neural Networks contain artificial neurons which are called units. These
units are arranged in a series of layers that together constitute the whole Artificial
Neural Network in a system.
A layer can have only a dozen units or millions of units as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as well as
hidden layers.
The input layer receives data from the outside world which the neural network
needs to analyze or learn about. Then this data passes through one or multiple
hidden layers that transform the input into data that is valuable for the output layer.
Finally, the output layer provides an output in the form of a response of the
Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to
another. Each of these connections has weights that determine the influence of
one unit on another unit. As the data transfers from one unit to another, the neural
network learns more and more about the data which eventually results in an output
from the output layer.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The structures and operations of human neurons serve as the basis for artificial
neural networks. It is also known as neural networks or neural nets. The input layer
of an artificial neural network is the first layer, and it receives input from external
sources and releases it to the hidden layer, which is the second layer. In the hidden
layer, each neuron receives input from the previous layer neurons, computes the
weighted sum, and sends it to the neurons in the next layer. These connections are
weighted means effects of the inputs from the previous layer are optimized more
or less by assigning different-different weights to each input and it is adjusted
during the training process by optimizing these weights for improved model
performance.
Artificial neurons vs Biological neurons
The concept of artificial neural networks comes from biological neurons found in
animal brains So they share a lot of similarities in structure and function wise.
Structure: The structure of artificial neural networks is inspired by
biological neurons. A biological neuron has a cell body or soma to
process the impulses, dendrites to receive them, and an axon that
transfers them to other neurons. The input nodes of artificial neural
networks receive input signals, the hidden layer nodes compute these
input signals, and the output layer nodes compute the final output by
processing the hidden layer’s results using activation functions.
Biological Neuron Artificial Neuron
Dendrite Inputs
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Synapses Weights
Axon Output
Synapses: Synapses are the links between biological neurons that enable
the transmission of impulses from dendrites to the cell body. Synapses
are the weights that join the one-layer nodes to the next-layer nodes in
artificial neurons. The strength of the links is determined by the weight
value.
Learning: In biological neurons, learning happens in the cell body nucleus
or soma, which has a nucleus that helps to process the impulses. An
action potential is produced and travels through the axons if the
impulses are powerful enough to reach the threshold. This becomes
possible by synaptic plasticity, which represents the ability of
synapses to become stronger or weaker over time in reaction to changes
in their activity. In artificial neural networks, backpropagation is a
technique used for learning, which adjusts the weights between nodes
according to the error or differences between predicted and actual
outcomes.
Biological Neuron Artificial Neuron
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Artificial neural networks are trained using a training set. For example, suppose you
want to teach an ANN to recognize a cat. Then it is shown thousands of different
images of cats so that the network can learn to identify a cat. Once the neural
network has been trained enough using images of cats, then you need to check if it
can identify cat images correctly. This is done by making the ANN classify the
images it is provided by deciding whether they are cat images or not. The output
obtained by the ANN is corroborated by a human-provided description of whether
the image is a cat image or not. If the ANN identifies incorrectly then back-
propagation is used to adjust whatever it has learned during
training. Backpropagation is done by fine-tuning the weights of the connections in
ANN units based on the error rate obtained. This process continues until the
artificial neural network can correctly recognize a cat in an image with minimal
possible error rates.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. Social Media: Artificial Neural Networks are used heavily in Social Media.
For example, let’s take the ‘People you may know’ feature on Facebook
that suggests people that you might know in real life so that you can
send them friend requests. Well, this magical effect is achieved by using
Artificial Neural Networks that analyze your profile, your interests, your
current friends, and also their friends and various other factors to
calculate the people you might potentially know. Another common
application of Machine Learning in social media is facial recognition.
This is done by finding around 100 reference points on the person’s face
and then matching them with those already available in the database
using convolutional neural networks.
2. Marketing and Sales: When you log onto E-commerce sites like Amazon
and Flipkart, they will recommend your products to buy based on your
previous browsing history. Similarly, suppose you love Pasta, then
Zomato, Swiggy, etc. will show you restaurant recommendations based
on your tastes and previous order history. This is true across all new-age
marketing segments like Book sites, Movie services, Hospitality sites, etc.
and it is done by implementing personalized marketing. This uses
Artificial Neural Networks to identify the customer likes, dislikes,
previous shopping history, etc., and then tailor the marketing campaigns
accordingly.
3. Healthcare: Artificial Neural Networks are used in Oncology to train
algorithms that can identify cancerous tissue at the microscopic level at
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
X Y Output
0 0 0
0 1 1
1 0 1
1 1 0
Output= X.Y’+X’.Y
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The XOR gate can be usually termed as a combination of NOT and AND gates and
this type of logic finds its vast application in cryptography and fault tolerance. The
logical diagram of an XOR gate is shown below.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
So here we can see that the pink dots and red triangle points in the plot do not
overlap each other and the linear line is easily separating the two classes where the
upper boundary of the plot can be considered as one classification and the below
region can be considered as the other region of classification.
The XOR problem with neural networks can be solved by using Multi-Layer
Perceptrons or a neural network architecture with an input layer, hidden layer, and
output layer. So during the forward propagation through the neural networks, the
weights get updated to the corresponding layers and the XOR logic gets executed.
The Neural network architecture to solve the XOR problem will be as shown below.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
each layer, the XOR logic output can be yielded through forward
propagation. The overall neural network architecture uses the Relu
activation function to ensure the weights updated in each of the
processes to be 1 or 0 accordingly where for the positive set of weights
the output at the particular neuron will be 1 and for a negative weight
updation at the particular neuron will be 0 respectively. So let us
understand one output for the first input state
Example: For X1=0 and X2=0 we should get an input of 0. Let us solve it.
H1=RELU(0.1+0.1+0) = 0
H2=RELU(0.1+0.1+0)=0
So now we have obtained the weights that were propagated from the
input layer to the hidden layer. So now let us propagate from the hidden
layer to the output layer
Y=RELU(0.1+0.(-2))=0
So among the various logical operations, XOR logical operation is one such problem
wherein linear separability of data points is not possible using single neurons or
perceptrons. so for solving the XOR problem for neural networks it is necessary to
use multiple neurons in the neural network architecture with certain weights and
appropriate activation functions to solve the XOR problem with neural networks.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Perceptron uses the step function that returns +1 if the weighted sum of its input 0
and -1.
The activation function is used to map the input between the required value like (0, 1)
or (-1, 1).
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
a. In the first step, all the inputs x are multiplied with their weights w.
b. In this step, add all the increased values and call them the Weighted sum.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
c. In our last step, apply the weighted sum to a correct Activation Function.
For Example:
There are two types of architecture. These types focus on the functionality of
artificial neural networks as follows-
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
o Multi-Layer Perceptron
This is the first proposal when the neural model is built. The content of the neuron's
local memory contains a vector of weight.
The single vector perceptron is calculated by calculating the sum of the input vector
multiplied by the corresponding element of the vector, with each increasing the
amount of the corresponding component of the vector by weight. The value that is
displayed in the output is the input of an activation function.
o The weights are initialized with the random values at the origination of each
training.
o For each element of the training set, the error is calculated with the difference
between the desired output and the actual output. The calculated error is used
to adjust the weight.
o The process is repeated until the fault made on the entire training set is less
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
than the specified limit until the maximum number of iterations has been
reached.
In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input
layer take input and forward it for further process, in the diagram above the nodes
in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to
numbers between 0 and 1 using the sigmoid formula.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Machine learning models are built on assumptions such as the one where X and Y are
related. An Inductive Bias of linear regression is the linear relationship between X and Y. In
this way, a line or hyperplane gets fitted to the data.
When X and Y have a complex relationship, it can get difficult for a Linear Regression
method to predict Y. For this situation, the curve must be multi-dimensional or approximate
to the relationship.
A manual adjustment is needed sometimes based on the complexity of the function and the
number of layers within the network. In most cases, trial and error methods combined with
experience get used to accomplishing this. Hence, this is the reason these parameters are
called hyperparameters.
During data flow, input nodes receive data, which travel through hidden layers, and exit
output nodes. No links exist in the network that could get used to by sending information
back from the output node.
Feed forward neural networks serve as the basis for object detection in photos, as shown in
the Google Photos app.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
When the feed forward neural network gets simplified, it can appear as a single layer
perceptron.
This model multiplies inputs with weights as they enter the layer. Afterward, the weighted
input values get added together to get the sum. As long as the sum of the values rises above
a certain threshold, set at zero, the output value is usually 1, while if it falls below the
threshold, it is usually -1.
As a feed forward neural network model, the single-layer perceptron often gets used for
classification. Machine learning can also get integrated into single-layer perceptrons.
Through training, neural networks can adjust their weights based on a property called the
delta rule, which helps them compare their outputs with the intended values.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Input layer:
The neurons of this layer receive input and pass it on to the other layers of the
network. Feature or attribute numbers in the dataset must match the number of
neurons in the input layer.
Output layer:
According to the type of model getting built, this layer represents the forecasted
feature.
Hidden layer:
Input and output layers get separated by hidden layers. Depending on the type of
model, there may be several hidden layers.
There are several neurons in hidden layers that transform the input before actually
transferring it to the next layer. This network gets constantly updated with weights in
order to make it easier to predict.
Neuron weights:
Neurons:
Artificial neurons get used in feed forward networks, which later get adapted from
biological neurons. A neural network consists of artificial neurons.
Neurons function in two ways: first, they create weighted input sums, and second,
they activate the sums to make them normal.
Activation functions can either be linear or nonlinear. Neurons have weights based
on their inputs. During the learning phase, the network studies these weights.
Activation Function:
According to the activation function, the neurons determine whether to make a linear
or nonlinear decision. Since it passes through so many layers, it prevents the
cascading effect from increasing neuron outputs.
An activation function can be classified into three major categories: sigmoid, Tanh,
and Rectified Linear Unit (ReLu).
Sigmoid:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Tanh:
Only positive values are allowed to flow through this function. Negative values get
mapped to 0.
Cost function
In a feed forward neural network, the cost function plays an important role. The
categorized data points are little affected by minor adjustments to weights and
biases.
Thus, a smooth cost function can get used to determine a method of adjusting
weights and biases to improve performance.
Where,
b = biases
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
a = output vectors
x = input
Loss function
The loss function of a neural network gets used to determine if an adjustment needs
to be made in the learning process.
Neurons in the output layer are equal to the number of classes. Showing the
differences between predicted and actual probability distributions. Following is the
cross-entropy loss for binary classification.
In the gradient descent algorithm, the next point gets calculated by scaling the
gradient at the current position by a learning rate. Then subtracted from the current
position by the achieved value.
To decrease the function, it subtracts the value (to increase, it would add). As an
example, here is how to write this procedure:
The gradient gets adjusted by the parameter η, which also determines the step size.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Output units
In the output layer, output units are those units that provide the desired output or
prediction, thereby fulfilling the task that the neural network needs to complete.
There is a close relationship between the choice of output units and the cost
function. Any unit that can serve as a hidden unit can also serve as an output unit in
a neural network.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
There are many applications for these neural networks. The following are a few of them.
It is possible to identify feed forward management in this situation because the central
involuntary regulates the heartbeat before exercise.
An open-loop transfer converts non-minimum part systems into minimum part systems using this
technique.
Typical deep learning algorithms are neural networks (NNs). As a result of their unique
structure, their popularity results from their 'deep' understanding of data.
Furthermore, NNs are flexible in terms of complexity and structure. Despite all the advanced
stuff, they can't work without the basic elements: they may work better with the advanced
stuff, but the underlying structure remains the same
NNs get constructed similarly to our biological neurons, and they resemble the
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
following:
Neurons are hexagons in this image. In neural networks, neurons get arranged into
layers: input is the first layer, and output is the last with the hidden layer in the middle.
In the third step, a vector of ones gets multiplied by the output of our hidden layer:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Using the output value, we can calculate the result. Understanding these fundamental
concepts will make building NN much easier, and you will be amazed at how quickly you can
do it. Every layer's output becomes the following layer's input.
In a network, the architecture refers to the number of hidden layers and units in each layer
that make up the network.
A feed forward network based on the Universal Approximation Theorem must have a
"squashing" activation function at least on one hidden layer.
The network can approximate any Borel measurable function within a finite-
dimensional space with at least some amount of non-zero error when there are
enough hidden units.
It simply states that we can always represent any function using the multi-layer
perceptron (MLP), regardless of what function we try to learn.
Thus, we now know there will always be an MLP to solve our problem, but there is no
specific method for finding it.
It is impossible to say whether it will be possible to solve the given problem if we use
N layers with M hidden units.
Research is still ongoing, and for now, the only way to determine this configuration is
by experimenting with it.
There are two possible explanations for this. Firstly, the optimization algorithm may
not find the correct parameters, and secondly, the training algorithms may use the
wrong function because of overfitting.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The goal is to reduce the cost function given the training data while learning a neural
network. Network weights and biases of all neurons in each layer determine the cost
function. Backpropagation gets used to calculate the gradient of the cost function
iteratively. And then update weights and biases in the opposite direction to reduce
the gradient.
We must define the error of the backpropagation formula to specify i-th neuron in the
i-th layer of a network for the j-th training. Example as follows (in which
represents the weighted input to the neuron, and L represents the loss.)
Below is the full derivation of the formulas. For each formula below, L stands for the
output layer, g for the activation function, ∇ the gradient, W[l]T layer l weights
transposed.
A proportional activation of neuron i at layer l based on bli bias from layer i to layer i,
w[k] weight from layer l to layer l-1, and ak−1 activation of neuron k at layer l-1 for
training example j.
The first equation shows how to calculate the error at the output layer for sample j.
Following that, we can use the second equation to calculate the error in the layer just
before the output layer.
Based on the error values for the next layer, the second equation can calculate the
error in any layer. Because this algorithm calculates errors backward, it is known as
backpropagation.
For sample j, we calculate the gradient of the loss function by taking the third and
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
We can update biases and weights by averaging gradients of the loss function
relative to biases and weights for all samples using the average gradients.
The process is known as batch gradient descent. We will have to wait a long time if
we have too many samples.
Even though this algorithm is faster than batch gradient descent, it does not yield a
good estimate of the gradient calculated using a single sample.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
In neural networks, a hidden layer is located between the input and output
of the algorithm, in which the function applies weights to the inputs and
directs them through an activation function as the output. In short, the
hidden layers perform nonlinear transformations of the inputs entered into
the network. Hidden layers vary depending on the function of the neural
network, and similarly, the layers may vary depending on their associated
weights.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Hidden layers allow for the function of a neural network to be broken down
into specific transformations of the data. Each hidden layer function is
specialized to produce a defined output. For example, a hidden layer
functions that are used to identify human eyes and ears may be used in
conjunction by subsequent layers to identify faces in images. While the
functions to identify eyes alone are not enough to independently recognize
objects, they can function jointly within a neural network.
Hidden Layers and Machine Learning:
Hidden layers are very common in neural networks, however their use and
architecture often varies from case to case. As referenced above, hidden
layers can be separated by their functional characteristics. For example, in
a CNN used for object recognition, a hidden layer that is used to identify
wheels cannot solely identify a car, however when placed in conjunction
with additional layers used to identify windows, a large metallic body, and
headlights, the neural network can then make predictions and identify
possible cars within visual data.
1. Well if the data is linearly separable then you don't need any hidden
layers at all.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
It should be kept in mind that increasing hidden layers would also increase the
complexity of the model and choosing hidden layers such as 8, 9, or in two
digits may sometimes lead to overfitting.
These above algorithms are only a general use case and they can be moulded
according to use case. Sometimes the number of nodes in hidden layers can
increase also in subsequent layers and the number of hidden layers can also
be more than the ideal case.
This whole depends upon the use case and problem statement that we are
dealing with.
Architecture Design:
Types of neural networks models are listed below:
Perceptron
Feed Forward Neural Network
Multilayer Perceptron
Convolutional Neural Network
Radial Basis Functional Neural Network
Recurrent Neural Network
LSTM – Long Short-Term Memory
Sequence to Sequence Models
Modular Neural Network
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Artificial neural networks are inspired by the biological neurons within the human
body which activate under certain circumstances resulting in a related action
performed by the body in response. Artificial neural nets consist of various layers of
interconnected artificial neurons powered by activation functions that help in
switching them ON/OFF. Like traditional machine algorithms, here too, there are
certain values that neural nets learn in the training phase.
Briefly, each neuron receives a multiplied version of inputs and random weights,
which is then added with a static bias value (unique to each neuron layer); this is
then passed to an appropriate activation function which decides the final value to be
given out of the neuron. There are various activation functions available as per the
nature of input values. Once the output is generated from the final neural net layer,
loss function (input vs output) is calculated, and backpropagation is performed
where the weights are adjusted to make the loss minimum. Finding optimal values of
weights is what the overall operation focuses around. Please refer to the following
for better understanding-
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Weights are numeric values that are multiplied by inputs. In backpropagation, they
are modified to reduce the loss. In simple words, weights are machine learned
values from Neural Networks. They self-adjust depending on the difference between
predicted outputs vs training inputs.
Activation Function is a mathematical formula that helps the neuron to switch
ON/OFF.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until
it is not able to perform the task for which it is being trained. Derivatives of the
activation function to be known at network design time is required to
Backpropagation.
Input values
X1=0.05
X2=0.10
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
Now, we will backpropagate this error to update the weights using a backward pass.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Now, we will calculate the updated weight w5new with the help of the following
formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
following values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
From equation (2), it is clear that we cannot partially differentiate it with respect to
w1 because there is no any w1. We split equation (1) into multiple terms so that we
can easily differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate Etotal with respect to w1 as
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
We again Split both because there is no any y1 and y2 term in E1 and E2.
We split it as
Now, we find the value of by putting values in equation (18) and (19) as
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Now, we will calculate the updated weight w1new with the help of the following
formula
In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network
when we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation,
the total error is down to 0.291027924. After repeating this process 10,000, the total
error is down to 0.0000351085. At this point, the outputs neurons generate
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
0.159121960 and 0.984065734 i.e., nearby our target value when we feed forward
the 0.05 and 0.1.
Introduction
Keras
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
PyTorch
PyTorch is an open source machine learning library for Python, based on Torch. It is
used for applications such as natural language processing and was developed by
Facebook’s AI research group.
Comparison Factors
All the three frameworks are related to each other and also have certain basic
differences that distinguishes them from one another.
So lets have a look at the parameters that distinguish them:
Level of API
Speed
Architecture
Debugging
Dataset
Popularity
Level of API
Keras is a high-level API capable of running on top of TensorFlow, CNTK and Theano.
It has gained favor for its ease of use and syntactic simplicity, facilitating fast
development.
TensorFlow is a framework that provides both high and low level APIs. Pytorch, on
the other hand, is a lower-level API focused on direct work with array expressions. It
has gained immense interest in the last year, becoming a preferred solution for
academic research, and applications of deep learning requiring optimizing custom
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
expressions.
Speed
Architecture
Keras has a simple architecture. It is more readable and concise . Tensorflow on the
other hand is not very easy to use even though it provides Keras as a framework that
makes work easier. PyTorch has a complex architecture and the readability is less
when compared to Keras.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Debugging
In keras, there is usually very less frequent need to debug simple networks. But in
case of Tensorflow, it is quite difficult to perform debugging. Pytorch on the other
hand has better debugging capabilities as compared to the other two.
Dataset
Keras is usually used for small datasets as it is comparitively slower. On the other
hand, TensorFlow and PyTorch are used for high performance models
and large datasets that require fast execution.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Popularity
With the increasing demand in the field of Data Science, there has been an
enormous growth of Deep learning technology in the industry. With this, all the three
frameworks have gained quite a lot of popularity. Keras tops the list followed by
TensorFlow and PyTorch. It has gained immense popularity due to
its simplicity when compared to the other two.
These were the parameters that distinguish all the three frameworks but there is no
absolute answer to which one is better. The choice ultimately comes down to
Technical background
Requirements and
Ease of Use
Final Verdict
Now coming to the final verdict of Keras vs TensorFlow vs PyTorch let’s have a look
at the situations that are most preferable for each one of these three deep learning
frameworks
Rapid Prototyping
Small Dataset
Multiple back-end support
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Large Dataset
High Performance
Functionality
Object Detection
Flexibility
Short Training Duration
Debugging capabilities
Now with this, we come to an end of this comparison on Keras vs TensorFlow vs
PyTorch. I Hope you guys enjoyed this article and understood which Deep Learning
Framework is most suitable for you.
Now that you have understood the comparison between Keras, TensorFlow and
PyTorch, check out the AI and Deep Learning With Tensorflow by Edureka, a trusted
online learning company with a network of more than 250,000 satisfied learners
spread across the globe. This Certification Training is curated by industry
professionals as per the industry requirements & demands. You will master
concepts such as SoftMax function, Autoencoder Neural Networks, Restricted
Boltzmann Machine (RBM) and work with libraries like Keras & TFLearn.
Also, discover your full abilities in becoming an AI and ML professional through
our Artificial Intelligence Course. Learn about various AI-related technologies like
Machine Learning, Deep Learning, Computer Vision, Natural Language Processing,
Speech Recognition, and Reinforcement learning.
UNIT-II:
CONVOLUTION NEURAL NETWORK (CNN): Introduction to
CNNs and their applications in computer vision, CNN basic
architecture, Activation functions-sigmoid, tanh, ReLU, Softmax
layer, Types of pooling layers, Training of CNN in TensorFlow,
various popular CNN architectures: VGG, Google Net, ResNet
etc, Dropout, Normalization, Data augmentation
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning has proved to be a very powerful tool because of its ability
to handle large amounts of data. The interest to use hidden layers has
surpassed traditional techniques, especially in pattern recognition. One of
the most popular deep neural networks is Convolutional Neural Networks
(also known as CNN or ConvNet) in deep learning, especially when it
comes to Computer Vision applications.
Since the 1950s, the early days of AI, researchers have struggled to make a
system that can understand visual data. In the following years, this field
came to be known as Computer Vision. In 2012, computer vision took a
quantum leap when a group of researchers from the University of Toronto
developed an AI model that surpassed the best image recognition
algorithms, and that too by a large margin.
The AI system, which became known as AlexNet (named after its main
creator, Alex Krizhevsky), won the 2012 ImageNet computer vision contest
with an amazing 85 percent accuracy. The runner-up scored a modest 74
percent on the test.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Background of CNNs
CNN’s were first developed and used around the 1980s. The most that a
CNN could do at that time was recognize handwritten digits. It was mostly
used in the postal sectors to read zip codes, pin codes, etc. The important
thing to remember about any deep learning model is that it requires a large
amount of data to train and also requires a lot of computing resources.
This was a major drawback for CNNs at that period and hence CNNs were
only limited to the postal sectors and it failed to enter the world of machine
learning.
In the past few decades, Deep Learning has proved to be a very powerful
tool because of its ability to handle large amounts of data. The interest
to use hidden layers has surpassed traditional techniques, especially in
pattern recognition. One of the most popular deep neural networks is
Convolutional Neural Networks (also known as CNN or ConvNet) in deep
learning, especially when it comes to Computer Vision applications.
Since the 1950s, the early days of AI, researchers have struggled to make a
system that can understand visual data. In the following years, this field
came to be known as Computer Vision. In 2012, computer vision took a
quantum leap when a group of researchers from the University of Toronto
developed an AI model that surpassed the best image recognition
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The AI system, which became known as AlexNet (named after its main
creator, Alex Krizhevsky), won the 2012 ImageNet computer vision contest
with an amazing 85 percent accuracy. The runner-up scored a modest 74
percent on the test.
Background of CNNs
What Is a CNN?
How does it work?
What Is a Pooling Layer?
Limitations of CNNs
Frequently Asked Questions
Background of CNNs
CNN’s were first developed and used around the 1980s. The most that a
CNN could do at that time was recognize handwritten digits. It was mostly
used in the postal sectors to read zip codes, pin codes, etc. The important
thing to remember about any deep learning model is that it requires a large
amount of data to train and also requires a lot of computing resources.
This was a major drawback for CNNs at that period and hence CNNs were
only limited to the postal sectors and it failed to enter the world of machine
learning.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
In 2012 Alex Krizhevsky realized that it was time to bring back the branch
of deep learning that uses multi-layered neural networks. The availability of
large sets of data, to be more specific ImageNet datasets with millions of
labeled images and an abundance of computing resources enabled
researchers to revive CNNs.
What Is a CNN?
Bottom line is that the role of the ConvNet is to reduce the images into a
form that is easier to process, without losing features that are critical for
getting a good prediction.
Before we go to the working of CNN’s let’s cover the basics such as what is
an image and how is it represented. An RGB image is nothing but a matrix
of pixel values having three planes whereas a grayscale image is the same
but it has a single plane. Take a look at this image to understand more.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
and 1) that specify how likely the image is to belong to a “class.” For
instance, if you have a ConvNet that detects cats, dogs, and horses, the
output of the final layer is the possibility that the input image contains any
of those animals.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
On the other hand, Average Pooling returns the average of all the
values from the portion of the image covered by the Kernel. Average
Pooling simply performs dimensionality reduction as a noise suppressing
mechanism. Hence, we can say that Max Pooling performs a lot better
than Average Pooling.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep learning is a form of machine learning that requires a neural network with a minimum of
three layers. Networks with multiple layers are more accurate than single-layer networks. Deep
learning applications often use CNNs or RNNs (recurrent neural networks).
The CNN architecture is especially useful for image recognition and image classification, as well
as other computer vision tasks because they can process large amounts of data and produce
highly accurate predictions. CNNs can learn the features of an object through multiple iterations,
eliminating the need for manual feature engineering tasks like feature extraction.
It is possible to retrain a CNN for a new recognition task or build a new model based on an
existing network with trained weights. This is known as transfer learning. This enables ML model
developers to apply CNNs to different use cases without starting from scratch.
A Convolutional Neural Network (CNN) is a type of deep learning algorithm specifically designed
for image processing and recognition tasks. Compared to alternative classification models,
CNNs require less preprocessing as they can automatically learn hierarchical feature
representations from raw input images. They excel at assigning importance to various objects
and features within the images through convolutional layers, which apply filters to detect local
patterns.
The connectivity pattern in CNNs is inspired by the visual cortex in the human brain, where
neurons respond to specific regions or receptive fields in the visual space. This architecture
enables CNNs to effectively capture spatial relationships and patterns in images. By stacking
multiple convolutional and pooling layers, CNNs can learn increasingly complex features, leading
to high accuracy in tasks like image classification, object detection, and segmentation.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Convolutional neural networks are known for their superiority over other artificial neural networks,
given their ability to process visual, textual, and audio data. The CNN architecture comprises
three main layers: convolutional layers, pooling layers, and a fully connected (FC) layer.
There can be multiple convolutional and pooling layers. The more layers in the network, the
greater the complexity and (theoretically) the accuracy of the machine learning model. Each
additional layer that processes the input data increases the model’s ability to recognize objects
and patterns in the data.
Convolutional layers are the key building block of the network, where most of the computations
are carried out. It works by applying a filter to the input data to identify features. This filter,
known as a feature detector, checks the image input’s receptive fields for a given feature. This
operation is referred to as convolution.
The filter is a two-dimensional array of weights that represents part of a 2-dimensional image. A
filter is typically a 3×3 matrix, although there are other possible sizes. The filter is applied to a
region within the input image and calculates a dot product between the pixels, which is fed to an
output array. The filter then shifts and repeats the process until it has covered the whole image.
The final output of all the filter processes is called the feature map.
The CNN typically applies the ReLU (Rectified Linear Unit) transformation to each feature map
after every convolution to introduce nonlinearity to the ML model. A convolutional layer is
typically followed by a pooling layer. Together, the convolutional and pooling layers make up a
convolutional block.
Additional convolution blocks will follow the first block, creating a hierarchical structure with later
layers learning from the earlier layers. For example, a CNN model might train to detect cars in
images. Cars can be viewed as the sum of their parts, including the wheels, boot, and windscreen.
Each feature of a car equates to a low-level pattern identified by the neural network, which then
combines these parts to create a high-level pattern.
A pooling or down sampling layer reduces the dimensionality of the input. Like a convolutional
operation, pooling operations use a filter to sweep the whole input image, but it doesn’t use
weights. The filter instead uses an aggregation function to populate the output array based on
the receptive field’s values.
Average pooling: The filter calculates the receptive field’s average value when it scans
the input.
Max pooling: The filter sends the pixel with the maximum value to populate the output
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The FC layer performs classification tasks using the features that the previous layers and filters
extracted. Instead of ReLu functions, the FC layer typically uses a softmax function that
classifies inputs more appropriately and produces a probability score between 0 and 1.
(OR)
Basic Architecture
There are three types of layers that make up the CNN which are the
convolutional layers, pooling layers, and fully-connected (FC) layers.
When these layers are stacked, a CNN architecture will be formed. In
addition to these three layers, there are two more important parameters
which are the dropout layer and the activation function which are defined
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
below.
1. Convolutional Layer
This layer is the first layer that is used to extract the various features
from the input images. In this layer, the mathematical operation of
convolution is performed between the input image and a filter of a
particular size MxM. By sliding the filter over the input image, the dot
product is taken between the filter and the parts of the input image with
respect to the size of the filter (MxM).
The convolution layer in CNN passes the result to the next layer once
applying the convolution operation in the input. Convolutional layers in
CNN benefit a lot as they ensure the spatial relationship between the
pixels is intact.
2. Pooling Layer
In Max Pooling, the largest element is taken from feature map. Average
Pooling calculates the average of the elements in a predefined sized
Image section. The total sum of the elements in the predefined section is
computed in Sum Pooling. The Pooling Layer usually serves as a bridge
between the Convolutional Layer and the FC Layer.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The Fully Connected (FC) layer consists of the weights and biases along
with the neurons and is used to connect the neurons between two
different layers. These layers are usually placed before the output layer
and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to
the FC layer. The flattened vector then undergoes few more FC layers
where the mathematical functions operations usually take place. In this
stage, the classification process begins to take place. The reason two
layers are connected is that two fully connected layers will perform better
than a single connected layer. These layers in CNN reduce the human
supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause
overfitting in the training dataset. Overfitting occurs when a particular
model works so well on the training data causing a negative impact in the
model’s performance when used on a new data.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the
activation function. They are used to learn and approximate any kind of
continuous and complex relationship between variables of the network. In
simple words, it decides which information of the model should fire in the
forward direction and which ones should not at the end of the network.
Activate functions:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The function doesn't do anything to the weighted sum of the input, it simply
spits out the value it was given.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
All layers of the neural network will collapse into one if a linear activation
function is used. No matter the number of layers in the neural network,
the last layer will still be a linear function of the first layer. So, essentially,
a linear activation function turns the neural network into just one layer.
Because of its limited power, this does not allow the model to create complex
mappings between the network’s inputs and outputs.
Now, let’s have a look at ten different non-linear neural networks activation
functions and their characteristics.
The larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to
0.0, as shown below.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Here’s why sigmoid/logistic activation function is one of the most widely used
functions:
It is commonly used for models where we have to predict the probability
as an output. Since probability of anything exists only between the range
of 0 and 1, sigmoid is the right choice because of its range.
The function is differentiable and provides a smooth gradient, i.e.,
preventing jumps in output values. This is represented by an S-shape of
the sigmoid activation function.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
As we can see from the above Figure, the gradient values are only significant
for range -3 to 3, and the graph gets much flatter in other regions.
It implies that for values greater than 3 or less than -3, the function will have
very small gradients. As the gradient value approaches zero, the network
ceases to learn and suffers from the Vanishing gradient problem.
The output of the logistic function is not symmetric around zero. So the
output of all the neurons will be of the same sign. This makes
the training of the neural network more difficult and unstable.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Have a look at the gradient of the tanh activation function to understand its
limitations.
As you can see— it also faces the problem of vanishing gradients similar to the
sigmoid activation function. Plus the gradient of the tanh function is much
steeper as compared to the sigmoid function.
💡 Note: Although both sigmoid and tanh face vanishing gradient issue, tanh
is zero centered, and the gradients are not restricted to move in a certain
direction. Therefore, in practice, tanh nonlinearity is always preferred to
sigmoid nonlinearity.
ReLU Function
ReLU stands for Rectified Linear Unit.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The main catch here is that the ReLU function does not activate all the neurons
at the same time.
The neurons will only be deactivated if the output of the linear transformation is
less than 0.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The negative side of the graph makes the gradient value zero. Due to this
reason, during the backpropagation process, the weights and biases for some
neurons are not updated. This can create dead neurons which never get
activated.
All the negative input values become zero immediately, which decreases
the model’s ability to fit or train from the data properly.
Note: For building the most reliable ML models, split your data into train,
validation, and test sets.
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU
problem as it has a small positive slope in the negative area.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Leaky ReLU
The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact
that it does enable backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the
left side of the graph comes out to be a non-zero value. Therefore, we would no
longer encounter dead neurons in that region.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
This function provides the slope of the negative part of the function as an
argument a. By performing backpropagation, the most appropriate value of a is
learnt.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Parametric ReLU
The parameterized ReLU function is used when the leaky ReLU function still
fails at solving the problem of dead neurons, and the relevant information is not
successfully passed to the next layer.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. Convolutional layer: These layers generate a feature map by sliding a filter over the
input image and recognizing patterns in images.
2. Pooling layers: These layers downsample the feature map to introduce Translation
invariance, which reduces the overfitting of the CNN model.
3. Fully Connected Dense Layer: This layer contains the same number of units as the
number of classes and the output activation function such as “softmax” or “sigmoid”
Pooling layers are one of the building blocks of Convolutional Neural Networks.
Where Convolutional layers extract features from images, Pooling
layers consolidate the features learned by CNNs. Its purpose is to gradually shrink
the representation’s spatial dimension to minimize the number of parameters and
computations in the network.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. Max pooling: This works by selecting the maximum value from every pool. Max Pooling
retains the most prominent features of the feature map, and the returned image is
sharper than the original image.
2. Average pooling: This pooling layer works by getting the average of the pool. Average
pooling retains the average values of features of the feature map. It smoothes the image
while keeping the essence of the feature in an image.
Image source
Let’s explore the working of Pooling Layers using TensorFlow. Create a NumPy array
and reshape it.
Max Pooling
Create a MaxPool2D layer with pool_size=2 and strides=2. Apply the MaxPool2D
layer to the matrix, and you will get the MaxPooled output in the tensor form. By
applying it to the matrix, the Max pooling layer will go through the matrix by
computing the max of each 2×2 pool with a jump of 2. Print the shape of the tensor.
Use tf.squeeze to remove dimensions of size 1 from the shape of a tensor.
Average Pooling
Create an AveragePooling2D layer with the same 2 pool_size and strides. Apply the
AveragePooling2D layer to the matrix. By applying it to the matrix, the average
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
pooling layer will go through the matrix by computing the average of 2×2 for each
pool with a jump of 2. Print the shape of the matrix and Use tf.squeeze to convert the
output into a readable form by removing all 1 size dimensions.
The GIF here explains how these pooling layers go through the input matrix and
computes the maximum or average for max pooling and average pooling,
respectively.
Global Pooling Layers often replace the classifier’s fully connected or Flatten layer.
The model instead ends with a convolutional layer that produces as many feature
maps as there are target classes and performs global average pooling on each of
the feature maps to combine each feature map into a single value.
Create the same NumPy array but with a different shape. By keeping the same shape
as above, the Global Pooling layers will reduce them to one value.
Considering a tensor of shape h*w*n, the output of the Global Average Pooling layer
is a single value across h*w that summarizes the presence of the feature. Instead of
downsizing the patches of the input feature map, the Global Average Pooling layer
downsizes the whole h*w into 1 value by taking the average.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
With the tensor of shape h*w*n, the output of the Global Max Pooling layer is a
single value across h*w that summarizes the presence of a feature. Instead of
downsizing the patches of the input feature map, the Global Max Pooling layer
downsizes the whole h*w into 1 value by taking the maximum.
If we are familiar with the building blocks of Connects, we are ready to build one with
TensorFlow. We use the MNIST dataset for image classification.
Preparing the data is the same as in the previous tutorial. We can run code and jump
directly into the architecture of CNN.
Here, we are executing our code in Google Colab (an online editor of machine
learning).
These are the steps used to training the CNN (Convolutional Neural Network).
Steps:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. import numpy as np
2. import tensorflow as tf
3.
4. from sklearn.datasets import fetch_mldata
5. #Change USERNAME by the username of the machine
6. ##Windows USER
7. mnist = fetch_mldata('C:\\Users\\USERNAME\\Downloads\\MNIST original')
8. ## Mac User
9. mnist = fetch_mldata('/Users/USERNAME/Downloads/MNIST original')
10. print(mnist.data.shape)
11. print(mnist.target.shape)
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
CNN uses filters on the pixels of any image to learn detailed patterns compared to
global patterns with a traditional neural network. To create CNN, we have to define:
1. A convolutional Layer: Apply the number of filters to the feature map. After
convolution, we need to use a relay activation function to add non-linearity to the
network.
2. Pooling Layer: The next step after the Convention is to downsampling the maximum
facility. The objective is to reduce the mobility of the feature map to prevent
overfitting and improve the computation speed. Max pooling is a traditional
technique, which splits feature maps into subfields and only holds maximum values.
3. Fully connected Layers: All neurons from the past layers are associated with the
other next layers. The CNN has classified the label according to the features from
convolutional layers and reduced with any pooling layer.
CNN Architecture
o Convolutional Layer: It applies 14 5x5 filters (extracting 5x5-pixel sub-regions),
o Pooling Layer: This will perform max pooling with a 2x2 filter and stride of 2 (which
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. Conv2d (). Construct a two-dimensional convolutional layer with the number of filters,
filter kernel size, padding, and activation function like arguments.
2. max_pooling2d (). Construct a two-dimensional pooling layer using the max-pooling
algorithm.
3. Dense (). Construct a dense layer with the hidden layers and units
Let's see in detail how to construct every building block before wrapping everything
in the function.
We need to define a tensor with the shape of the data. For that, we can use
the module tf.reshape. In this module, we need to declare the tensor to reshape and
to shape the tensor. The first argument is the feature of the data, that is defined in
the argument of a function.
A picture has a width, a height, and a channel. The MNIST dataset is a monochromic
picture with the 28x28 size. We set the batch size into -1 in the shape argument so
that it takes the shape of the features ["x"]. The advantage is to tune the batch size to
hyperparameters. If the batch size is 7, the tensor feeds 5,488 values (28 * 28 * 7).
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The first convolutional layer has 18 filters with the kernel size of 7x7 with equal
padding. The same padding has both the output tensor and input tensor have the
same width and height. TensorFlow will add zeros in the rows and columns to
ensure the same size.
We use the Relu activation function. The output size will be [28, 28, and 14].
The next step after the convolutional is pooling computation. The pooling
computation will reduce the extension of the data. We can use the module
max_pooling2d with a size of 3x3 and stride of 2. We use the previous layer as input.
The output size can be [batch_size, 14, 14, and 15].
The second CNN has exactly 32 filters, with the output size of [batch_size, 14, 14, 32].
The size of the pooling layer has the same as ahead, and output shape is [batch_size,
14, 14, and18].
1. conv2 = tf.layers.conv2d(
2. inputs=pool1,
3. filters=36,
4. kernel_size=[5, 5],
5. padding="same",
6. activation=tf.nn.relu)
7. pool2 = tf.layers.max_pooling2d (inputs=conv2, pool_size=[2, 2],strides=2).
We have to define the fully-connected layer. The feature map has to be compressed
before to be combined with the dense layer. We can use the module reshape with a
size of 7*7*36.
The dense layer will connect 1764 neurons. We add a Relu activation function and
can add a Relu activation function. We add a dropout regularization term with a rate
of 0.3, meaning 30 percent of the weights will be 0. The dropout takes place only
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
along the training phase. The cnn_model_fn() has an argument mode to declare if
the model needs to trained or to be evaluate.
Finally, we define the last layer with the prediction of model. The output shape is
equal to the batch size 12, equal to the total number of images in the layer.
1. #Logit Layer
2. logits = tf.layers.dense(inputs=dropout, units=12)
We can create a dictionary that contains classes and the possibility of each class.
The module returns the highest value with tf.argmax () if the logit layers. The
softmax function returns the probability of every class.
VGG
VGG (Visual Geometry Group) is a research group within the Department of Engineering Science
at the University of Oxford. The VGG group is well-known for its work in computer vision,
particularly in the area of convolutional neural networks (CNNs).
One of the most famous contributions from the VGG group is the VGG model, also known as
VGGNet. The VGG model is a deep neural network that achieved state-of-the-art performance on
the ImageNet Large Scale Visual Recognition Challenge in 2014, and has been widely used as a
benchmark for image classification and object detection tasks.
The VGG model is characterized by its use of small convolutional filters (3×3) and deep
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
architecture (up to 19 layers), which enables it to learn increasingly complex features from input
images. The VGG model also uses max pooling layers to reduce the spatial resolution of the
feature maps and increase the receptive field, which can improve its ability to recognize objects
of varying scales and orientations.
The VGG model has inspired many subsequent research efforts in deep learning, including the
development of even deeper neural networks and the use of residual connections to improve
gradient flow and training stability.
ResNet
ResNet (short for “Residual Neural Network”) is a family of deep convolutional neural networks
designed to overcome the problem of vanishing gradients that are common in very deep
networks. The idea behind ResNet is to use “residual blocks” that allow for the direct propagation
of gradients through the network, enabling the training of very deep networks.
A residual block consists of two or more convolutional layers followed by an activation function,
combined with a shortcut connection that bypasses the convolutional layers and adds the
original input directly to the output of the convolutional layers after the activation function.
This allows the network to learn residual functions that represent the difference between the
convolutional layers’ input and output, rather than trying to learn the entire mapping directly. The
use of residual blocks enables the training of very deep networks, with hundreds or thousands of
layers, significantly alleviating the issue of vanishing gradients.
GoogLeNet
GoogLeNet is notable for its use of the Inception module, which consists of multiple parallel
convolutional layers with different filter sizes, followed by a pooling layer, and concatenation of
the outputs. This design allows the network to learn features at multiple scales and resolutions,
while keeping the computational cost manageable. The network also includes auxiliary
classifiers at intermediate layers, which encourage the network to learn more discriminative
features and prevent overfitting.
GoogLeNet builds upon the ideas of previous convolutional neural networks, including LeNet,
which was one of the first successful applications of deep learning in computer vision. However,
GoogLeNet is much deeper and more complex than LeNet.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Dropout:
What is a Dropout?
The term “dropout” refers to dropping out the nodes (input and
hidden layer) in a neural network (as seen in Figure 1). All the
forward and backwards connections with a dropped node are
temporarily removed, thus creating a new network architecture
out of the parent network. The nodes are dropped by a dropout
probability of p.
For instance, if the hidden layers have 1000 neurons (nodes) and a
dropout is applied with drop probability = 0.5, then 500 neurons
would be randomly dropped in every iteration (batch).
Generally, for the input layers, the keep probability, i.e. 1- drop
probability, is closer to 1, 0.8 being the best as suggested by the
authors. For the hidden layers, the greater the drop probability more
sparse the model, where 0.5 is the most optimised keep probability,
that states dropping 50% of the nodes.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Figure 2: (a) Hidden layer features without dropout; (b) Hidden layer
features with dropout
From figure 2, we can easily make out that the hidden layer with
dropout is learning more of the generalised features than the co-
adaptations in the layer without dropout. It is quite apparent, that
dropout breaks such inter-unit relations and focuses more on
generalisation.
Dropout Implementation
Enough of the talking! Let’s head to the mathematical explanation of
the dropout.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Normalization:
Normalization is a pre-processing technique used to standardize data. In
other words, having different sources of data inside the same range. Not
normalizing the data before training can cause problems in our network,
making it drastically harder to train and decrease its learning speed.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
inflated importance.
There are two main methods to normalize our data. The most
straightforward method is to scale it to a range from 0 to 1:
The data point to normalize, the mean of the data set, the highest
value, and the lowest value. This technique is generally used in the
inputs of the data. The non-normalized data points with wide ranges can
cause instability in Neural Networks. The relatively large inputs can
cascade down to the layers, causing problems such as exploding gradients.
The other technique used to normalize data is forcing the data points to
have a mean of 0 and a standard deviation of 1, using the following formula:
being the data point to normalize, the mean of the data set, and the
standard deviation of the data set. Now, each data point mimics a standard
normal distribution. Having all the features on this scale, none of them will
have a bias, and therefore, our models will learn better.
Batch Normalization
Batch Norm is a normalization technique done between the layers of a
Neural Network instead of in the raw data. It is done along mini-batches
instead of the full data set. It serves to speed up training and use higher
learning rates, making learning easier.
Following the technique explained in the previous section, we can define
the normalization formula of Batch Norm as:
being mz the mean of the neurons’ output and sz the standard deviation of
the neurons’ output.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
How Is It Applied?
In the following image, we can see a regular feed-forward Neural Network:
are the inputs, the output of the neurons, the output of the activation
functions, and the output of the network:
Batch Norm – in the image represented with a red line – is applied to the
neurons’ output just before applying the activation function. Usually, a
neuron without Batch Norm would be computed as follows:
being the linear transformation of the neuron, the weights of the neuron,
the bias of the neurons, and the activation function. The model learns
the parameters and . Adding Batch Norm, it looks as:
being the output of Batch Norm, the mean of the neurons’ output,
the standard deviation of the output of the neurons, and and learning
parameters of Batch Norm. Note that the bias of the neurons ( ) is
removed. This is because as we subtract the mean , any constant over
the values of z– such as b – can be ignored as it will be subtracted by
itself.
The parameters and shift the mean and standard deviation,
respectively. Thus, the outputs of Batch Norm over a layer result in a
distribution with a mean and a standard deviation of . These values are
learned over epochs and the other learning parameters, such as the
weights of the neurons, aiming to decrease the loss of the model.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Data Augmentation:
Data Augmentation provides many possibilities to alter the original image and can
be useful to add enough data for larger models. It is important to learn the
techniques of Data Augmentation and its advantages and disadvantages. In this
post, I’ll cover all the details you need and show you a Python example using
PyTorch.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Rotation Arbitrary
Translation Arbitrary
Scaling Arbitrary
A table outlining the factor by which different methods multiply the existing training
data.
Flips
Rotation (at 90 degrees and finer angles)
Translation
Scaling
Salt and Pepper noise addition
Data Augmentation has even been used in applications like sound recognition. In the
next sections, I’ll cover these Data Augmentation methods in detail.
Flips:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
By Flipping images, the optimizer will not become biased that particular
features of an image are only on one side. To do this augmentation, the
original training image is flipped vertically or horizontally over one axis of
the image. As a result, the features continually change directions.
Stella the Puppy sitting on a car seat Stella the Puppy Flipped over the vertical axis.
Rotation:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Stella the Puppy sitting on a car seat Stella the Puppy rotated 90 degrees.
Translation:
Stella the Puppy sitting on a car seat Stella the Puppy translated and cropped so
she’s only partly visible.
Scaling:
Scaling provides more diversity in the training data of a machine learning model.
Scaling the image will ensure that the object is recognized by the network regardless
of how zoomed in or out the image is. Sometimes the object is tiny in the center.
Sometimes, the object is zoomed in the image and even cropped at some parts.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Stella the Puppy sitting on a car seat Stella the Puppy scaled up to be even larger than she is in
real life.
Salt and pepper noise addition is the addition of black and white dots (looking like
salt and pepper) to the image. This simulates dust and imperfections in real photos.
Even if the camera of the photographer is blurry or has spots on it, the image would
be better recognized by the model. The training data set is augmented to train the
model with more realistic images.
Stella the Puppy sitting on a car seat Stella the Puppy with Salt and Pepper noise added to the
image
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Data Augmentation is not useful when the variety required by the application cannot
be artificially generated. For example, if one were training a bird recognition model
and the training data contained only red birds. The training data could be augmented
by generating pictures with the color of the bird varied.
However, the artificial augmentation method may not capture the realistic color
details of birds when there is not enough variety of data to start with. For example, if
the augmentation method simply varied red for blue or green, etc. Realistic non-red
birds may have more complex color variations and the model may fail to recognize
the color. Having sufficient data is still important if one wants Data Augmentation to
work properly.
UNIT-III
RECURRENT NEURAL NETWORK (RNN): Introduction to
RNNs and their applications in sequential data analysis, Back
propagation through time (BPTT), Vanishing Gradient
Problem, gradient clipping Long Short-Term Memory (LSTM)
Networks, Gated Recurrent Units, Bidirectional LSTMs,
Bidirectional RNNs.
Recurrent Neural Network also known as (RNN) that works better than a
simple neural network when data is sequential like Time-Series data and
text data.
A Deep Learning approach for modelling sequential data is Recurrent
Neural Networks (RNN). RNNs were the standard suggestion for working
with sequential data before the advent of attention models. Specific
parameters for each element of the sequence may be required by a deep
feedforward model. It may also be unable to generalize to variable-length
sequences.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Recurrent Neural Networks use the same weights for each element of the
sequence, decreasing the number of parameters and allowing the model to
generalize to sequences of varying lengths. RNNs generalize to structured
data other than sequential data, such as geographical or graphical data,
because of its design.
Recurrent neural networks, like many other deep learning techniques, are
relatively old. They were first developed in the 1980s, but we didn’t
appreciate their full potential until lately. The advent of long short-term
memory (LSTM) in the 1990s, combined with an increase in computational
power and the vast amounts of data that we now have to deal with, has
really pushed RNNs to the forefront.
Neural networks imitate the function of the human brain in the fields of AI,
machine learning, and deep learning, allowing computer programs to
recognize patterns and solve common issues.
RNNs are a type of neural network that can be used to model sequence
data. RNNs, which are formed from feedforward networks, are similar to
human brains in their behaviour. Simply said, recurrent neural networks can
anticipate sequential data in a way that other algorithms can’t.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
All of the inputs and outputs in standard neural networks are independent
of one another, however in some circumstances, such as when predicting
the next word of a phrase, the prior words are necessary, and so the
previous words must be remembered. As a result, RNN was created, which
used a Hidden Layer to overcome the problem. The most important
component of RNN is the Hidden state, which remembers specific
information about a sequence.
RNNs have a Memory that stores all information about the calculations. It
employs the same settings for each input since it produces the same
outcome by performing the same task on all inputs or hidden layers.
RNNs are a type of neural network that has hidden states and allows past
outputs to be used as inputs. They usually go like this:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
RNN architecture can vary depending on the problem you’re trying to solve.
From those with a single input and output to those with many (with
variations between).
Below are some examples of RNN architectures that can help you better
understand this.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The input layer x receives and processes the neural network’s input before
passing it on to the middle layer.
Multiple hidden layers can be found in the middle layer h, each with its own
activation functions, weights, and biases. You can utilize a recurrent neural
network if the various parameters of different hidden layers are not
impacted by the preceding layer, i.e. There is no memory in the neural
network.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Source: MLtutorial.com
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
RNN can be used to build a deep learning model that can translate
text from one language to another without the need for human
intervention. You can, for example, translate a text from your native
language to English.
2. Text Creation:
RNNs can also be used to build a deep learning model for text
generation. Based on the previous sequence of words/characters
used in the text, a trained model learns the likelihood of occurrence of
a word/character. A model can be trained at the character, n-gram,
sentence, or paragraph level.
3. Captioning of images:
4.Recognition of Speech:
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
You can use stock market data to build a machine learning model that
can forecast future stock prices based on what the model learns from
historical data. This can assist investors in making data-driven
investment decisions.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
no idea of temporal order. Apart from its training, it has no memory of what
transpired in the past.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
To calculate the error, we take the output and calculate its error concerning
the actual result, but we have multiple outputs at each timestamp. Thus, the
regular Backpropagation won't work here. Therefore, we modify this algorithm
and call the new algorithm as Backpropagation through time.
Now the question arises: What is the total loss for this network? How do we
update the weights Ws, Wx, and Wy?
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
But when calculating the derivative of loss concerning Ws and Wx, it becomes
tricky.
Formula:
Now that we have calculated all three derivatives, we can easily update the
weights. This algorithm is known as Backpropagation through time, as we
used values across all the timestamps to calculate the gradients.
The algorithm at a glance:
We feed a sequence of timestamps of input and output pairs to the
network.
Then, we unroll the network then calculate and accumulate errors
across each timestamp.
Finally, we roll up the network and update weights.
Repeat the process.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Limitations of BPTT:
BPTT has difficulty with local optima. Local optima are a more significant
issue with recurrent neural networks than feed-forward neural networks. The
recurrent feedback in such networks creates chaotic responses in the error
surface, which causes local optima to occur frequently and in the wrong
locations on the error surface.
When using BPTT in RNN, we face problems such as exploding gradient and
vanishing gradient. To avoid issues such as exploding gradient, we use a
gradient clipping method to check if the gradient value is greater than the
threshold or not at each timestamp. If it is, we normalize it. This helps to
tackle exploding gradient.
We can use BPTT up to a limited number of steps like 8 or 10. If we
backpropagate further, the gradient becomes too negligible and is a Vanishing
gradient problem. To avoid the vanishing gradient problem, some of the
possible solutions are:
Using ReLU activation function in place of tanh or sigmoid
activation function.
Proper initializing the weight matrix can reduce the effect of
vanishing gradients. For example, using an identity matrix helps us
tackle this problem.
Using gated cells such as LSTM or GRUs.
It works quite similarly for RNNs, but here we’ve got a little bit more going on.
Secondly, you can calculate the cost function, or your error, at each
time point.
Basically, during the training, your cost function compares your outcomes (red
circles on the image below) to your desired output.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
As a result, you have these values throughout the time series, for every single
one of these red circles.
You’ve calculated the cost function et, and now you want to propagate your
cost function back through the network because you need to update the
weights.
The problem relates to updating wrec (weight recurring) – the weight that is
used to connect the hidden layers to themselves in the unrolled temporal loop.
For instance, to get from xt-3 to xt-2 we multiply xt-3 by wrec. Then, to get
from xt-2 to xt-1 we again multiply xt-2 by wrec. So, we multiply with the same
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
exact weight multiple times, and this is where the problem arises: when you
multiply something by a small number, your value decreases very quickly.
As we know, weights are assigned at the start of the neural network with the
random values, which are close to zero, and from there the network trains
them up. But, when you start with wrec close to zero and multiply xt, xt-1, xt-2,
xt-3, … by this value, your gradient becomes less and less with each
multiplication.
The lower the gradient is, the harder it is for the network to update the weights
and the longer it takes to get to the final result.
For instance, 1000 epochs might be enough to get the final weight for the time
point t, but insufficient for training the weights for the time point t-3 due to a
very low gradient at this point. However, the problem is not only that half of
the network is not trained properly.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The output of the earlier layers is used as the input for the further layers. Thus,
the training for the time point t is happening all along based on inputs that are
coming from untrained layers. So, because of the vanishing gradient, the
whole network is not being trained properly.
To sum up, if wrec is small, you have vanishing gradient problem, and if wrec
is large, you have exploding gradient problem
For the vanishing gradient problem, the further you go through the network,
the lower your gradient is and the harder it is to train the weights, which has a
domino effect on all of the further weights throughout the network.
That was the main roadblock to using Recurrent Neural Networks. But let’s
now check what are the possible solutions to this problem.
have Echo State Networks that are designed to solve the vanishing
gradient problem;
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
This requires first the estimation of the loss on one or more training examples,
then the calculation of the derivative of the loss, which is propagated backward
through the network in order to update the weights. Weights are updated using a
fraction of the back propagated error controlled by the learning rate.
It is possible for the updates to the weights to be so large that the weights
either overflow or underflow their numerical precision. In practice, the weights can
take on the value of an “NaN” or “Inf” when they overflow or underflow and for
practical purposes the network will be useless from that point forward, forever
predicting NaN values as signals flow through the invalid weights.
“ The difficulty that arises is that when the parameter gradient is very large, a
gradient descent parameter update could throw the parameters very far, into a region
where the objective function is larger, undoing much of the work that had been done
to reach the current solution.”
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
“One difficulty when training LSTM with the full gradient is that the derivatives
sometimes become excessively large, leading to numerical problems. To prevent
this, [we] clipped the derivative of the loss with respect to the network inputs to the
LSTM layers (before the sigmoid and tanh functions are applied) to lie within a
predefined range”
There are two main methods for updating the error derivative; they are:
Gradient Scaling.
Gradient Clipping.
Gradient scaling involves normalizing the error gradient vector such that vector norm
(magnitude) equals a defined value, such as 1.0.
… one simple mechanism to deal with a sudden increase in the norm of the gradients
is to rescale them whenever they go over a threshold
“When the traditional gradient descent algorithm proposes to make a very large step,
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
the gradient clipping heuristic intervenes to reduce the step size to be small enough
that it is less likely to go outside the region where the gradient indicates the direction
of approximately steepest descent.”
It is a method that only addresses the numerical stability of training deep neural
network models and does not offer any general improvement in performance.
The value for the gradient vector norm or preferred range can be configured by trial
and error, by using common values used in the literature or by first observing
common vector norms or ranges via experimentation and then choosing a sensible
value.
In our experiments we have noticed that for a given task and model size, training is
not very sensitive to this [gradient norm] hyperparameter and the algorithm behaves
well even for rather small thresholds.
It is common to use the same gradient clipping configuration for all layers in the
network. Nevertheless, there are examples where a larger range of error gradients
are permitted in the output layer compared to hidden layers.
The output derivatives […] were clipped in the range [−100, 100], and the LSTM
derivatives were clipped in the range [−10, 10]. Clipping the output gradients proved
vital for numerical stability; even so, the networks sometimes had numerical
problems late on in training, after they had started overfitting on the training data.
"My mom gave me a bicycle on my birthday because she knew that I wanted to go
biking with my friends."
As we can see from the above sentence, words that affect each other can be further
apart. For example, "bicycle" and "go biking" are closely related but are placed further
apart in the sentence.
An RNN network finds tracking the state with such a long context difficult. It needs
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
to find out what information is important. However, a GRU cell greatly alleviates this
problem.
GRU network was invented in 2014. It solves problems involving long sequences with
contexts placed further apart, like the above biking example. This is possible
because of how the GRU cell in the GRU architecture is built. Let us now delve deeper
into the understanding and working of the GRU network.
The Gated Recurrent Unit (GRU) cell is the basic building block of a GRU network. It
comprises three main components: an update gate, a reset gate, and a candidate
hidden state.
One of the key advantages of the GRU cell is its simplicity. Since it has fewer
parameters than a long short-term memory (LSTM) cell, it is faster to train and run
and less prone to overfitting.
Additionally, one thing to remember is that the GRU cell architecture is simple, the
cell itself is a black box, and the final decision on how much we should consider the
past state and how much should be forgotten is taken by this GRU cell. We need to
look inside and understand what the cell is thinking.
Here is a comparison of Gated Recurrent Unit (GRU) and Long Short-Term Memory
(LSTM) networks
GRU LSTM
Simpler structure with two gates More complex structure with three
Structure
(update and reset gate) gates (input, forget, and output gate)
Fewer parameters (3 weight
Parameters More parameters (4 weight matrices)
matrices)
Training Faster to train Slow to train
Space In most cases, GRU tend to use LSTM has a more complex structure
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
GRU LSTM
fewer memory resources due to and a larger number of parameters,
its simpler structure and fewer thus might require more memory
Complexity
parameters, thus better suited for resources and could be less effective
large datasets or sequences. for large datasets or sequences.
Generally performed similarly to LSTM generally performs well on
LSTM on many tasks, but in many tasks but is more
some cases, GRU has been computationally expensive and
Performanc
shown to outperform LSTM and requires more memory resources.
e
vice versa. It's better to try both LSTM has advantages over GRU in
and see which works better for natural language understanding and
your dataset and task. machine translation tasks.
A GRU cell keeps track of the important information maintained throughout the
network. A GRU network achieves this with the following two gates:
Reset Gate
Update Gate.
Given below is the simplest architectural form of a GRU cell.
Update gate
An update gate determines what current GRU cell will pass information to the next
GRU cell. It helps in keeping track of the most important information.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Let us see how the output of the Update Gate is obtained in a GRU cell. The input to
the update gate is the hidden layer at the previous timestep h(t−1)) and the current
input (xt). Both have their weights associated with them which are learned during the
training process. Let us say that the weights associated with h(t−1) is U(z), and that
of xt is Wz. The output of the update gate Zt is given by,
zt=σ(W(z)xt+U(z)ht−1)
Reset gate
A reset gate identifies the unnecessary information and decides what information to
be laid off from the GRU network. Simply put, it decides what information to delete at
the specific timestamp.
Let us see how the output of the Reset Gate is obtained in a GRU cell. The input to
the reset gate is the hidden layer at the previous timestep h(t−1) and the current
input xt. Both have their weights associated with them which are learned during the
training process. Let us say that the weights associated with h(t−1) is Ur, and that
of xt is Wr. The output of the update gate rt is given by,
rt=σ(W(r)xt+U(r)ht−1)
PS: It is important to note that the weights associated with the hidden layer at the
previous timestep and the current input are different for both gates. The values for
these weights are learned during the training process.
Gated Recurrent Unit (GRU) networks process sequential data, such as time series or
natural language, bypassing the hidden state from one time step to the next. The
hidden state is a vector that captures the information from the past time steps
relevant to the current time step. The main idea behind a GRU is to allow the network
to decide what information from the last time step is relevant to the current time
step and what information can be discarded.
A candidate's hidden state is calculated from the reset gate. This is used to
determine the information stored from the past. This is generally called the memory
component in a GRU cell. It is calculated by,
ht′=tanh(Wxt+rt⊙Uht−1)
Here, W - weight associated with the current input rt - Output of the reset gate U -
Weight associated with the hidden layer of the previous timestep ′ht′ - Candidate
hidden state
Hidden state
The following formula gives the new hidden state and depends on the update gate
and candidate hidden state.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
ht=zt⊙ht−1+(1−zt)⊙ht′
Here, zt - Output of update gate KaTeX parse error: Expected 'EOF', got '’' at position
2: h’̲ _t - Candidate hidden state ht−1 - Hidden state at the previous timestep
Now, we have all the basics to understand a GRU network's forward propagation (i.e.,
working). Without any further ado, let us get started.
In a Gated Recurrent Unit (GRU) cell, the forward propagation process includes
several steps:
Calculate the output of the update gate(zt) using the update gate formula:
Calculate the output of the reset gate(rt) using the reset gate formula
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
In the above image, we can see that whenever the network predicts wrongly, the
network compares it with the original label, and the loss is then propagated
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
throughout the network. This happens until all the weights' values are
identified so that the value of the loss function used to compute the loss is
minimum. During this time, the weights and biases associated with the
hidden layers and the input are fine-tuned.
What are the differences and similarities between LSTM and GRU in
terms of architecture and performance?
LSTM and GRU are two types of recurrent neural networks (RNNs) that can
handle sequential data, such as text, speech, or video. They are designed to
overcome the problem of vanishing or exploding gradients that affect the
training of standard RNNs. However, they have different architectures and
performance characteristics that make them suitable for different
applications. In this article, you will learn about the differences and
similarities between LSTM and GRU in terms of architecture and performance.
LSTM Architecture
LSTM stands for long short-term memory, and it consists of a series of memory
cells that can store and update information over long time steps. Each
memory cell has three gates: an input gate, an output gate, and a forget gate.
The input gate decides what information to add to the cell state, the output
gate decides what information to output from the cell state, and the forget
gate decides what information to discard from the cell state. The gates are
learned by the network based on the input and the previous hidden state.
GRU Architecture
GRU stands for gated recurrent unit, and it is a simplified version of LSTM. It has
only two gates: a reset gate and an update gate. The reset gate decides how
much of the previous hidden state to keep, and the update gate decides how
much of the new input to incorporate into the hidden state. The hidden state
also acts as the cell state and the output, so there is no separate output gate.
The GRU is easier to implement and requires fewer parameters than the
LSTM.
Performance Comparison
The performance of LSTM and GRU depends on the task, the data, and the
hyperparameters. Generally, LSTM is more powerful and flexible than GRU,
but it is also more complex and prone to overfitting. GRU is faster and more
efficient than LSTM, but it may not capture long-term dependencies as well
as LSTM. Some empirical studies have shown that LSTM and GRU perform
similarly on many natural language processing tasks, such as sentiment
analysis, machine translation, and text generation. However, some tasks may
benefit from the specific features of LSTM or GRU, such as image captioning,
speech recognition, or video analysis.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Bidirectional LSTM
Introduction:
To understand the working of Bi-LSTM first, we need to understand the unit cell of
LSTM and LSTM network. LSTM stands for long short-term memory. In 1977
Hochretier and Schmidhuber introduced LSTM networks. These are the most
commonly used recurrent neural networks.
Need of LSTM
As we know that sequential data is better handled by recurrent neural networks, but
sometimes it is also necessary to store the result of the previous data. For example,
“I will play cricket” and “I can play cricket” are two different sentences with different
meanings. As we can see, the meaning of the sentence depends on a single word so,
it is necessary to store the data of previous words. But no such memory is available
in simple RNN. To solve this problem, we need to study a term called LSTM.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Input gate
First, the current state x(t) and previous hidden state h(t-1) are passed into the input
gate, i.e., the second sigmoid function. The x(t) and h(t-1) values are transformed
between 0 and 1, where 0 is important, and 1 is not important. Furthermore, the
current and hidden state information will be passed through the tanh function. The
output from the tanh function will range from -1 to 1, and it will help to regulate the
network. The output values generated from the activation functions are ready for
point-by-point multiplication.
Forget gate
The forget gate decides which information needs to be kept for further processing
and which can be ignored. The hidden state h(t-1) and current input X(t) information
are passed through the sigmoid function. After passing the values through the
sigmoid function, it generates values between 0 and 1 that conclude whether the
part of the previous output is necessary (by giving the output closer to 1).
Output gate
The output gate helps in deciding the value of the next hidden state. This state
contains information on previous inputs. First, the current and previously hidden
state values are passed into the third sigmoid function. Then the new cell state
generated from the cell state is passed through the tanh function. Both these
outputs are multiplied point-by-point. Based upon the final value, the network decides
which information the hidden state should carry. This hidden state is used for
prediction.
Finally, the new cell state and the new hidden state are carried over to the next step.
To conclude, the forget gate determines which relevant information from the prior
steps is needed. The input gate decides what relevant information can be added
from the current step, and the output gates finalize the next hidden state.
How do LSTM works?
The Lengthy Short Term Memory architecture was inspired by an examination of
error flow in current RNNs, which revealed that long time delays were inaccessible to
existing designs due to backpropagated error, which either blows up or decays
exponentially.
An LSTM layer is made up of memory blocks that are recurrently linked. These
blocks can be thought of as a differentiable version of a digital computer's memory
chips. Each one has recurrently connected memory cells as well as three
multiplicative units – the input, output, and forget gates – that offer continuous
analogs of the cells' write, read, and reset operations.
What is Bi-LSTM?
Bidirectional LSTM networks function by presenting each training sequence
forward and backward to two independent LSTM networks, both of which are
coupled to the same output layer. This means that the Bi-LSTM contains
comprehensive, sequential information about all points before and after each point
in a particular sequence.
In other words, rather than encoding the sequence in the forward direction
only, we encode it in the backward direction as well and concatenate the results from
both forward and backward LSTM at each time step. The encoded representation of
each word now understands the words before and after the specific word.
Below is the basic architecture of Bi-LSTM.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Working of Bi-LSTM
Let us understand the working of Bi-LSTM using an example. Consider the sentence
“I will swim today”. The below image represents the encoded representation of the
sentence in the Bi-LSTM network.
So when forward LSTM occurs, “I” will be passed into the LSTM network at time t = 0,
“will” at t = 1, “swim” at t = 2, and “today” at t = 3. In backward LSTM “today” will be
passed into the network at time t = 0, “swim” at t = 1, “will” at t = 2, and “I” at t = 3. In
this way, results both forward and backward LSTM at each time step is calculated.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
UNIT- IV
GENERATIVE ADVERSARIAL NETWORKS (GANS):
Generative models, Concept and principles of GANs,
Architecture of GANs (generator and discriminator
networks), Comparison between discriminative and
generative models, Generative Adversarial Networks
(GANs), Applicationsof GANs
Generative Adversarial Networks and its models:
Introducton:
generates a new set of data based on training data that look like training data. GANs
have two main blocks(two neural networks) which compete with each other and are
able to capture, copy, and analyze the variations in a dataset. The two models are
GANs. To understand the term GAN let’s break it into separate three parts
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
visually.
The generator network takes random input (typically noise) and generates samples,
such as images, text, or audio, that resemble the training data it was trained on. The
goal of the generator is to produce samples that are indistinguishable from real data.
The discriminator network, on the other hand, tries to distinguish between real and
generated samples. It is trained with real samples from the training data and
The training process involves an adversarial game between the generator and the
discriminator. The generator aims to produce samples that fool the discriminator,
while the discriminator tries to improve its ability to distinguish between real and
generated data. This adversarial training pushes both networks to improve over time.
real and generated data. Ideally, this process converges to a point where the
generator is capable of generating high-quality samples that are difficult for the
synthesis, text generation, and even video generation. They have been used for tasks
images, and more. GANs have greatly advanced the field of generative modeling and
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Machine learning algorithms and neural networks can easily be fooled to misclassify
things by adding some amount of noise to data. After adding some amount of noise,
the chances of misclassifying the images increase. Hence the small rise that, is it
possible to implement something that neural networks can start visualizing new
patterns like sample train data. Thus, GANs were built that generate new fake results
Two major components of GANs are Generator and Discriminator. The role of the
generator is like a thief to generate the fake samples based on the original sample
and make the discriminator fool to understand Fake as real. On the other hand, a
samples created by Generator and classify them as Fake or real. This competition
between both the component goes on until the level of perfection is achieved where
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
predicts data is fake or real. It is trained on real data and provides feedback to a
generator.
fake data based on original(real) data. It is also a neural network that has hidden
layers, activation, loss function. Its aim is to generate the fake image based on
feedback and make the discriminator fool that it cannot predict a fake image. And
when the discriminator is made a fool by the generator, the training stops and we
Here the generative model captures the distribution of data and is trained in such a
manner to generate the new sample that tries to maximize the probability of the
other hand is based on a model that estimates the probability that the sample it
receives is from training data not from the generator and tries to classify it
accurately and minimize the GAN accuracy. Hence the GAN network is formulated as
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
a minimax game where the Discriminator is trying to minimize its reward V(D, G) and
Now you might be wondering how is an actual architecture of GAN, and how two
neural networks are build and training and prediction is done? To simplify it have a
We know that both components are neural networks. we can see that generator
network.
We know the geometric intuition of GAN, Now let us understand the training of Gan.
In this section training of Generator and Discriminator will separately be clear to you.
The problem statement is key to the success of the project so the first step is to
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
define your problem. GANs work with a different set of problems you are aiming so
you need to define What you are creating like audio, poem, text, Image is a type of
problem.
There are many different types of GAN, that we will study further. we have to define
Data you are providing is without Noise and only contains real images, and for fake
The discriminator loss helps improve its performance and penalize it when it
Provide some Fake inputs for the generator(Noise) and It will use some random
noise and generate some fake outputs. when Generator is trained, Discriminator is
Idle and when Discriminator is trained, Generator is Idle. During generator training
through any random noise as input, it tries to transform it into meaningful data. to
get meaningful output from the generator takes time and runs under many epochs.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
gradients.
The samples which are generated by Generator will pass to Discriminator and It will
predict the data passed to it is Fake or real and provide feedback to Generator again.
Again Generator will be trained on the feedback given by Discriminator and try to
improve performance.
This is an iterative process and continues running until the Generator is not
I hope that the working of the GAN network is completely understandable and now
let us understand the loss function it uses and minimize and maximize in this
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
iterative process. The generator tries to minimize the following loss function while
the discriminator tries to maximize it. It is the same as a minimax game if you have
ever played.
D(x) is the discriminator’s estimate of the probability that real data instance x is real.
D(G(z)) is the discriminator’s estimate of the probability that a fake instance is real.
Ez is the expected value over all random inputs to the generator (in effect, the
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
7) Cycle GAN
GANs can be used to generate new examples for image datasets in various domains,
such as medical imaging, satellite imagery, and natural language processing. By
generating synthetic data, researchers can augment existing datasets and improve
the performance of machine learning models.
GANs can generate realistic photographs of various objects and scenes, including
landscapes, animals, and architecture. These rendered images can be used to
augment existing image datasets or to create entirely new datasets.
GANs can be used to generate cartoon characters that are similar to those found in
popular movies or television shows. These developed characters can create new
content or customize existing characters in games and other applications.
Image-to-Image Translation:
GANs can translate images from one domain to another, such as converting a
photograph of a real-world scene into a line drawing or a painting. You can create
new content or transform existing images in various ways.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Text-to-Image Translation:
GANs can be used to generate images based on a given text description. You can
use it to create visual representations of concepts or generate images for machine
learning tasks.
Semantic-Image-to-Photo Translation:
GANs can translate images from a semantic representation (such as a label map or
a segmentation map) into a realistic photograph. You can use it to generate
synthetic data for training machine learning models or to visualize concepts more
practically.
GANs can generate frontal views of faces from images that show the face at an
angle. You can use it to improve face recognition algorithms' performance or
synthesize pictures for use in other applications.
GANs can generate images of people in new poses, such as difficult or impossible
for humans to achieve. It can be used to create new content or to augment existing
image datasets.
Photos to Emojis:
GANs can be used to convert photographs of people into emojis, creating a more
personalized and expressive form of communication.
Photograph Editing:
GANs can be used to edit photographs in various ways, such as changing the
background, adding or removing objects, or altering the appearance of people or
animals in the image.
Face Aging:
GANs can be used to generate images of people at different ages, allowing users to
visualize how they might look in the future or to see what they might have looked like
in the past.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Core Idea
Mathematical Intuition
Applications
Since these models use different approaches to machine learning, both are
suited for specific tasks i.e., Generative models are useful for unsupervised
learning tasks. In contrast, discriminative models are useful for supervised
learning tasks. GANs(Generative adversarial networks) can be thought of
as a competition between the generator, which is a component of the
generative model, and the discriminator, so basically, it is generative vs.
discriminative model.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Outliers
Computational Cost
Some of the comparisons based on the following criteria between Discriminative and
Generative Models:
Based on Performance
Generative models need fewer data to train compared with discriminative models
since generative models are more biased as they make stronger assumptions,
In general, if we have missing data in our dataset, then Generative models can work
with these missing data, while discriminative models can’t. This is because, in
generative models, we can still estimate the posterior by marginalizing the unseen
observed.
Based on Applications
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label, i.e., target outcome, so they can only solve classification
The inspiration comes from the human mind, how we use past experiences to help
us make informed decisions in the present and the future. And while there are
already many applications of ML and DL, the future possibilities are endless.
Quintillions of data are generated all over the world almost daily, so getting fresh
data is easy. But in order to work with this gigantic amount of data, we need new
algorithms or we need to scale up existing ones.
1. Discriminative models
2. Generative models
Mathematically, generative classifiers assume a functional form for P(Y) and P(X|Y),
then generate estimated parameters from the data and use the Bayes’ theorem to
calculate P(Y|X) (posterior probability). Meanwhile, discriminative classifiers assume
a functional form of P(Y|X) and estimate the parameters directly from the provided
data.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Discriminative model
The majority of discriminative models, aka conditional models, are used for
supervised machine learning. They do what they ‘literally’ say, separating the data
points into different classes and learning the boundaries using probability estimates
and maximum likelihood.
Outliers have little to no effect on these models. They are a better choice than
generative models, but this leads to misclassification problems which can be a
major drawback.
Here are some examples and a brief description of the widely used discriminative
models:
A few other examples are commonly-used neural nets, k-nearest neighbor (KNN),
conditional random field (CRF), random forest, etc.
Generative model
As the name suggests, generative models can be used to generate new data points.
These models are usually used in unsupervised machine learning problems.
Generative models go in-depth to model the actual data distribution and learn the
different data points, rather than model just the decision boundary between classes.
These models are prone to outliers, which is their only drawback when compared to
discriminative models. The mathematics behind generative models is quite intuitive
too. The method is not direct like in the case of discriminative models. To calculate
P(Y|X), they first estimate the prior probability P(Y) and the likelihood probability
P(X|Y) from the data provided.
Putting the values into Bayes’ theorem’s equation, we get an accurate value for
P(Y|X).
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
1. Bayesian network: Also known as Bayes’ network, this model uses a directed
acyclic graph (DAG) to draw Bayesian inferences over a set of random variables to
calculate probabilities. It has many applications like prediction, anomaly detection,
time series prediction, etc.
2. Autoregressive model: Mainly used for time series modeling, it finds a correlation
between past behaviors to predict future behaviors.
3. Generative adversarial network (GAN): It’s based on deep learning technology and
uses two sub models. The generator model trains and generates new data points
and the discriminative model classifies these ‘generated’ data points into real or fake.
Some other examples include Naive Bayes, Markov random field, hidden Markov
model (HMM), latent Dirichlet allocation (LDA), etc.
Discriminative models divide the data space into classes by learning the boundaries,
whereas generative models understand how the data is embedded into the space.
Both the approaches are widely different, which makes them suited for specific
tasks.
Deep learning has mostly been using supervised machine learning algorithms like
artificial neural networks (ANNs), convolutional neural networks (CNNs), and
recurrent neural networks (RNNs). ANN is the earliest in the trio and leverages
artificial neurons, backpropagation, weights, and biases to identify patterns based on
the inputs. CNN is mostly used for image recognition and computer vision tasks. It
works by pooling important features from an input image. RNN, which is the latest of
the three, is used in advanced fields like natural language processing, handwriting
recognition, time series analysis, etc.
These are the fields where discriminative models are effective and better used for
deep learning as they work well for supervised tasks.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Apart from these, deep learning and neural nets can be used to cluster images based
on similarities. Algorithms like autoencoder, Boltzmann machine, and self-organizing
maps are popular unsupervised deep learning algorithms. They make use of
generative models for tasks like exploratory data analysis (EDA) of high dimensional
datasets, image denoising, image compression, anomaly detection and even
generating new images.
This Person Does Not Exist - Random Face Generator is an interesting website that
uses a type of generative model called StyleGAN to create realistic human faces,
even though the people in these images don’t exist!
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Auto-encoders:
Autoencoders are a type of deep learning algorithm that are designed to receive an
input and transform it into a different representation. They play an important part in
image construction.
Artificial Intelligence encircles a wide range of technologies and techniques that enable computer
systems to solve problems like Data Compression which is used in computer vision, computer
networks, computer architecture, and many other fields. Autoencoders are unsupervised neural
networks that use machine learning to do this compression for us.
We have a similar machine learning algorithm ie. PCA (principal component analysis) which
does the same task.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Applications of Autoencoders
Image Coloring
Autoencoders are used for converting any black and white picture into a colored image. Depending on
what is in the picture, it is possible to tell what the color should be.
Feature variation
It extracts only the required features of an image and generates the output by removing any noise or
unnecessary interruption.
Dimensionality Reduction
The reconstructed image is the same as our input but with reduced dimensions. It helps in providing
the similar image with a reduced pixel value.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Denoising Image
The input seen by the autoencoder is not the raw input but a stochastically corrupted version. A
denoising autoencoder is thus trained to reconstruct the original input from the noisy version.
Watermark Removal
It is also used for removing watermarks from images or to remove any object while filming a video or
a movie.
Architecture of Autoencoders
An Autoencoder consist of three layers:
1. Encoder
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
2. Code
3. Decoder
Encoder: This part of the network compresses the input into a latent space representation.
The encoder layer encodes the input image as a compressed representation in a reduced
dimension. The compressed image is the distorted version of the original image.
Code: This part of the network represents the compressed input which is fed to the decoder.
Decoder: This layer decodes the encoded image back to the original dimension. The decoded
image is a lossy reconstruction of the original image and it is reconstructed from the latent
space representation.
The layer between the encoder and decoder, ie. the code is also known as Bottleneck. This is a well-
designed approach to decide which aspects of observed data are relevant information and what
aspects can be discarded. It does this by balancing two criteria:
An autoencoder consists of two parts: an encoder network and a decoder network. The encoder
network compresses the input data, while the decoder network reconstructs the compressed data
back into its original form. The compressed data, also known as the bottleneck layer, is typically much
smaller than the input data.
The encoder network takes the input data and maps it to a lower-dimensional representation. This
lower-dimensional representation is the compressed data. The decoder network takes this
compressed data and maps it back to the original input data. The decoder network is essentially the
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
The bottleneck layer is the layer in the middle of the autoencoder that contains the compressed data.
This layer is much smaller than the input data, which is what allows for compression. The size of the
bottleneck layer determines the amount of compression that can be achieved.
Autoencoders differ from other deep learning architectures, such as convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), in that they do not require labeled data. Autoencoders
can learn the underlying structure of the data without any explicit labels.
There are two types of image compression: lossless and lossy. Lossless compression methods
preserve all of the data in the original image, while lossy compression methods discard some of the
data to achieve higher compression rates.
Autoencoders can be used for both lossless and lossy compression. Lossless compression can be
achieved by using a bottleneck layer that is the same size as the input data. In this case, the
autoencoder essentially learns to encode and decode the input data without any loss of information.
Lossy compression can be achieved by using a bottleneck layer that is smaller than the input data. In
this case, the autoencoder learns to discard some of the data to achieve higher compression rates.
The amount of data that is discarded depends on the size of the bottleneck layer.
Here are some examples of image compression using autoencoders:
Autoencoders are a type of neural network that can be used for image compression and
reconstruction. The process involves compressing an image into a smaller representation and then
reconstructing it back to its original form. Image reconstruction is the process of creating an image
from compressed data.
The compressed data can be thought of as a compressed version of the original image. To
reconstruct the image, the compressed data is fed through a decoder network, which expands the
data back to its original size. The reconstructed image will not be identical to the original, but it will be
a close approximation.
Autoencoders use a loss function to determine how well the reconstructed image matches the
original. The loss function calculates the difference between the reconstructed image and the original
image. The goal of the autoencoder is to minimize the loss function so that the reconstructed image
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
An example of image reconstruction using autoencoders is the MNIST dataset, which consists of
handwritten digits. The autoencoder is trained on the dataset to compress and reconstruct the
images. Another example is the CIFAR-10 dataset, which consists of 32×32 color images of objects.
The autoencoder can be trained on this dataset to compress and reconstruct the images.
Autoencoders can be modified and improved for better image compression and reconstruction. Some
of the variations of autoencoders are:
Denoising autoencoders:
Denoising autoencoders are used to remove noise from images. The autoencoder is trained on noisy
images and is trained to reconstruct the original image from the noisy input.
Variational autoencoders:
Variational autoencoders (VAEs) are a type of autoencoder that learn the probability distribution of
the input data. VAEs are trained to generate new samples from the learned distribution. This makes
VAEs suitable for image generation tasks.
Convolutional autoencoders:
Convolutional autoencoders (CAEs) use convolutional neural networks (CNNs) for image
compression and reconstruction. CNNs are specialized neural networks that can learn features from
images.
The effectiveness of different types of autoencoders for image compression and reconstruction can
be compared using metrics such as PSNR and SSIM. CAEs are generally more effective for image
compression and reconstruction than other types of autoencoders. VAEs are better suited for image
generation tasks.
Real-Time Examples:
A real-time example of an autoencoder for image compression and reconstruction is Google’s Guetzli
algorithm. Guetzli uses a combination of a perceptual metric and a psycho-visual model to compress
images while maintaining their quality. Another example is the Deep Image Prior algorithm, which
uses a convolutional neural network to reconstruct images from compressed data.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Autoencoders have become increasingly popular for image compression and reconstruction tasks
due to their ability to learn efficient representations of the input data. In this, we will explore some of
the common applications of autoencoders for image compression and reconstruction.
Medical Imaging:
Autoencoders have shown great promise in medical imaging applications such as Magnetic
Resonance Imaging (MRI), Computed Tomography (CT), and X-Ray imaging. The ability of
autoencoders to learn feature representations from high-dimensional data has made them useful for
compressing medical images while preserving diagnostic information. For example, researchers have
developed a deep learning-based autoencoder approach for compressing 3D MRI images, which
achieved higher compression ratios than traditional compression methods while preserving
diagnostic quality. This can have significant implications for improving the storage and transmission
of medical images, especially in resource-limited settings.
Video Compression:
Autoencoders have also been used for video compression, where the goal is to compress a sequence
of images into a compact representation that can be transmitted or stored efficiently. One example of
this is the video codec AV1, which uses a combination of autoencoders and traditional compression
methods to achieve higher compression rates while maintaining video quality. The autoencoder
component of the codec is used to learn spatial and temporal features of the video frames, which are
then used to reduce redundancy in the video data.
Autonomous Vehicles:
Autoencoders are also useful for autonomous vehicle applications, where the goal is to compress
high-resolution camera images captured by the vehicle’s sensors while preserving critical information
for navigation and obstacle detection. For example, researchers have developed an autoencoder-
based approach for compressing images captured by a self-driving car, which achieved high
compression ratios while preserving the accuracy of object detection algorithms. This can have
significant implications for improving the performance and reliability of autonomous vehicles,
especially in scenarios where high-bandwidth communication is not available.
Autoencoders have also been used in social media and web applications, where the goal is to reduce
the size of image files to improve website loading times and reduce bandwidth usage. For example,
Facebook uses an autoencoder-based approach for compressing images uploaded to their platform,
which achieves high compression ratios while preserving image quality. This has led to faster loading
times for images on the platform and reduced data usage for users.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Autoencoders and GANs are both powerful techniques for learning from data
in an unsupervised way, but they have some differences and trade-offs.
Autoencoders are easier to train and more stable, but they tend to produce blurry or
distorted reconstructions or generations. GANs are harder to train and more prone to
mode collapse, where they produce only a few modes of the data distribution, but
they tend to produce sharper and more diverse generations. Depending on your goal
and your data, you might prefer one or the other, or even combine them in a hybrid
model.
Autoencoders are unsupervised models, which means that they are not
trained on labeled data. Instead, they are trained on unlabeled data and learn to
reconstruct the input data. GANs, on the other hand, are supervised models, which
means that they are trained on labeled data. The generator in a GAN is trained to
generate data that looks like the labeled data, and the discriminator is trained to
distinguish between real and fake data. Autoencoders are typically used for tasks
such as image denoising and compression. GANs are typically used for tasks such
as image generation and translation.
How can you combine GANs and autoencoders to create hybrid models for various tasks?
Generative adversarial networks (GANs) and autoencoders are two powerful types of
artificial neural networks that can learn from data and generate new samples. But
what if you could combine them to create hybrid models that can perform various
tasks, such as image synthesis, anomaly detection, or domain adaptation? In this
article, you will learn how GANs and autoencoders work, and how you can combine
them to create hybrid models for various tasks.
GANs are composed of two networks: a generator and a discriminator. The generator tries
to create realistic samples from random noise, while the discriminator tries to distinguish
between real and fake samples. The two networks compete with each other, improving their
skills over time. Autoencoders are composed of two networks: an encoder and a decoder.
The encoder compresses the input data into a lower-dimensional representation, while the
decoder reconstructs the input data from the representation. The goal is to minimize the
reconstruction error, while learning useful features from the data.
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
Hybrid models
Hybrid models are models that combine GANs and autoencoders in different ways,
depending on the task and the objective. For example, you can use an autoencoder
as the generator of a GAN, and train it to fool the discriminator, while also minimizing
the reconstruction error. This way, you can generate realistic samples that are
similar to the input data, but also have some variations. Alternatively, you can use a
GAN as the encoder of an autoencoder, and train it to encode the input data into a
latent space that is compatible with the discriminator. This way, you can learn a
meaningful representation of the data that can be used for downstream tasks, such
as classification or clustering.
Image synthesis
One of the most common tasks for hybrid models is image synthesis, which is the
process of creating new images from existing ones, or from scratch. For example,
you can use a hybrid model to synthesize images of faces, animals, or landscapes,
by using an autoencoder as the generator of a GAN, and feeding it with real images
or random noise. This way, you can create diverse and realistic images that preserve
the attributes of the input data, but also have some variations. You can also use a
hybrid model to synthesize images of different domains, such as converting photos
to paintings, or day to night, by using a GAN as the encoder of an autoencoder, and
feeding it with images from both domains. This way, you can learn a common latent
space that can be used to transfer the style or the attributes of one domain to
another.
Anomaly detection
Another task for hybrid models is anomaly detection, which is the process of
identifying abnormal or unusual patterns in the data, such as outliers, frauds, or
defects. For example, you can use a hybrid model to detect anomalies in images,
such as damaged products, or medical conditions, by using an autoencoder as the
generator of a GAN, and feeding it with normal images. This way, you can train the
autoencoder to reconstruct normal images well, but fail to reconstruct abnormal
images. Then, you can use the reconstruction error or the discriminator score as a
measure of anomaly. You can also use a hybrid model to detect anomalies in time
series, such as sensor readings, or financial transactions, by using a GAN as the
encoder of an autoencoder, and feeding it with normal time series. This way, you can
train the GAN to encode normal time series well, but fail to encode abnormal time
series. Then, you can use the latent space or the discriminator score as a measure of
anomaly.
Domain adaptation
A third task for hybrid models is domain adaptation, which is the process of adapting
a model trained on one domain to work on another domain, without requiring labeled
Deep Learning
B.Tech – CSE (Emerging Technologies) R-20
data from the target domain. For example, you can use a hybrid model to adapt a
model trained on images of handwritten digits to work on images of handwritten
letters, by using a GAN as the encoder of an autoencoder, and feeding it with images
from both domains. This way, you can train the GAN to encode both domains into a
shared latent space that is invariant to the domain differences. Then, you can use
the latent space as the input for a classifier or a decoder that can work on both
domains.
Deep Learning