Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
504 views

DL 1 - ComputerVision With PyTorch Notes

This document provides an introduction and overview of PyTorch, including: - PyTorch's core modules are in torch.nn and provide common neural network layers. - PyTorch uses eager execution by default where operations are immediately carried out. - TorchScript allows exporting models to be used independently from Python.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
504 views

DL 1 - ComputerVision With PyTorch Notes

This document provides an introduction and overview of PyTorch, including: - PyTorch's core modules are in torch.nn and provide common neural network layers. - PyTorch uses eager execution by default where operations are immediately carried out. - TorchScript allows exporting models to be used independently from Python.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 304

Deep Learning with PyTorch:

INTRODUCTION:

The core PyTorch modules for building neural networks are located in torch.nn,
which provides common neural network layers and other architectural
components. Fully connected layers, convolutional layers, activation functions,
and loss functions can all be found here.

PyTorch defaults to an immediate execution model (eager mode). Whenever an


instruction involving PyTorch is executed by the Python interpreter, the
corresponding operation is immediately carried out by the underlying C++ or
CUDA implementation.

PyTorch also provides a way to compile models ahead of time through


TorchScript.
Using TorchScript, PyTorch can serialize a model into a set of instructions that
can be invoked independently from Python: say, from C++ programs or on
mobile devices. We can think about it as a virtual machine with a limited
instruction set, specific to tensor operations. This allows us to export our model,
either as TorchScript to be used with the PyTorch runtime, or in a standardized
format called ONNX.

INSTALLATION:

PYTORCH on CPU:

(1) Install anaconda. Create a new environment in it (if not using conda - as it
itself takes some space - create a simple python virtual environment as “python3
-m venv <env name>”, the activate it as “source venv/bin/activate”(use
“deactivate” to close virtual env). Might need to install pip too: “python3 -m pip

1
install --user --upgrade pip”. VSCode can then be started from this terminal as
“code .” - provided in VSCode “Shell add code to Path…” is clicked (one time)).
(2) Open terminal in conda by clicking on the “Play” button next to the
environment name. Install Pytorch using command:
conda install pytorch torchvision -c pytorch
(link: https://pytorch.org/get-started/locally/)
(3) Install/Launch Jupiter Notebook in Anaconda, then type(Press Shift+Enter to
execute command):
import torch
torch.__version__ # check torch version
torch.cuda.is_available() # check if GPU processing is possible.

Installing for GPU via CUDA:

Installing pytorch inside an anaconda environment is preferable, as pytorch


automatically pulls & installs it’s compatible CUDA inside the conda environment,
leaving the already installed CUDA at system level intact.
Note that CUDA should already be installed at the system level, before
installing pytorch inside conda, in order for pytorch to pull it’s compatible CUDA
inside the conda environment.
See link for using command: https://pytorch.org/get-started/locally/
(select “conda” in the “Package” options, if using conda)

- VSCode has a “Data Viewer” to view contents of a 2D slice of a


tensor. When debugging using breakpoints, right click on the variable in the
“Variables” window and select “View Value in Data Viewer”.

- The cuDNN (NVIDIA CUDA® Deep Neural Network library) library is a


library optimized for CUDA containing GPU implementations. Think of cuDNN as
a library for Deep Learning using CUDA and CUDA as a way to talk to the GPU.
cuDNN provides highly tuned implementations for standard routines such
as forward and backward convolution, attention, matmul, pooling, and
normalization.

- Nvidia commands:

2
(1) nvidia-smi: This utility/command (in command prompt) allows
administrators to query GPU device state and with the appropriate privileges,
permits administrators to modify GPU device state. Also gives info about GPU,
hardware, it’s drivers, CUDA version, etc. (can try -h/--help flag for more info)
Use nvidia-smi -l to get/track (almost) realtime update on memory usage.

(2) nvtop: For GPU process monitoring, a “htop” like task monitor for
AMD, Intel and NVIDIA GPUs. It can handle multiple GPUs and print information
about them in a htop-familiar way.

TENSORBOARD:

TensorBoard is a suite of web applications for inspecting and understanding your


model runs and graphs.
(link: https://towardsdatascience.com/a-complete-guide-to-using-tensorboard-
with-pytorch-53cb2301e8c3)

(1) Tensorboard can be installed with Pytorch using either:


- pip (pip install tensorboard)
- Anaconda (conda install -c conda-forge tensorboard)

(2) Import required libraries:


Ex:
from torch.utils.tensorboard import SummaryWriter
OR
from torch.utils.tensorboard.writer import SummaryWriter

3
# We will be creating instances of “SummaryWriter” and then add our model’s evaluation
features like loss, the number of correct predictions, accuracy, etc. to it. One of the novel
features of TensorBoard is that we simply have to feed our output tensors to it and it displays
the plot of all those metrics, in this way TensorBoard can take care of all the plotting for us.

tb = SummaryWriter()
# tb.add_graph(model, images) # displays model architecture graph.

for epoch in range(10):

total_loss = 0
total_correct = 0

for images, labels in train_loader:


images, labels = images.to(device), labels.to(device)
preds = model(images)

loss = criterion(preds, labels)


total_loss += loss.item() # during training.
total_correct += get_num_correct(preds, labels) # during validation.

optimizer.zero_grad()
loss.backward()
optimizer.step() # get LR from optim: optimizer.param_groups[0]['lr'].

# to get the current LR from a scheduler: scheduler.get_last_lr()[0].

tb.add_scalar("Loss", total_loss, epoch) # plot “total loss” at “epoch”.


tb.add_scalar("Correct", total_correct, epoch)
tb.add_scalar("Accuracy", total_correct/ len(train_set), epoch)

# tb.add_histogram("conv1.bias", model.conv1.bias, epoch)

print("epoch:", epoch, "total_correct:", total_correct, "loss:",total_loss)

# optional - hyperparameter tuning visualization.


tb.add_hparams(
{"lr": lr, "bsize": batch_size, "shuffle":shuffle},

4
{
"accuracy": total_correct/ len(train_set),
"loss": total_loss,
},
)

tb.close()

Note that every “tb” takes three arguments, one for the string which will be
the heading of the line chart/histogram, then the tensors containing the values to
be plotted, and finally a global step. Since we are doing an epoch wise analysis,
we have set it to epoch.
After running the code a (by default) “runs” folder will be created in the
project directory (can be changed while creating SummaryWriter()). All runs
going ahead will be sorted in the folder by date. This way you have an efficient
log of all runs which can be viewed and compared in TensorBoard.

(3) Now use the command line(or Anaconda Prompt) to redirect into your
project directory where the “runs” folder is present and run the following
command:
tensorboard --logdir runs
It will then serve TensorBoard on the localhost, the link for which will
be displayed in the terminal.

As seen below, running the command mentioned earlier to run TensorBoard will
display the line graph for the loss, num_correct_predictions, and accuracy.

5
Hyperparameter tuning Visualization:
tb.add_hparams allows us to add hyperparameters inside as
arguments to keep track of the training progress. It takes two dictionaries as
inputs, one for the hyperparameters and another for the evaluation metrics to be
analyzed.

This graph has the combined logs of all the runs so that you can use the
highest accuracy and lowest loss value and trace it back to the corresponding
batch size, learning rate and shuffle configurations.
From the Hyperparameter Graph it is very clear that setting shuffle to
False(0) tends to yield very poor results. Hence setting shuffle to always True(1)
is ideal for training as it adds randomization.

TERMINOLOGIES:

- Torch Hub, which is a mechanism through which authors can publish a


model on GitHub, with or without pretrained weights, and expose it through an
interface that PyTorch understands. This makes loading a pretrained model from
a third party as easy as loading a TorchVision model.

6
- Batch Size, Iterations & Epochs:
A single iteration (or step) refers to a single update of the model's weights
based on one batch of training data.
Batch size determines the size of the batch during a single weight update.
An Epoch is a complete pass through the entire training dataset.

Ex:
for epoch in range(<number of epochs>):
# iterate over multiple batches/iterations (say I), till the entire dataset is covered .
for (inputs, labels) in <dataloader>:
# perform training with model on inputs (of size = batch size)
# loss calculation.
# optimizer step i.e. weights update. # single iteration.
print(epoch) # one epoch ended.

So, to relate the number of iterations, batch size & epochs:


Number of iterations (I) = size of dataset(training samples) / batch size.
1 epoch = number of iterations * batch size.

tqdm: tqdm is a handy tool for showing progress bar during training, that
can indicate how much progress has been made. It works on all iterables.
Use {pip install tqdm} or {conda install -c conda-forge tqdm} (on conda) to install
tqdm.
Example code:
from tqdm import tqdm

for epoch in range(<number of epochs>):


for (inputs, labels) in enumerate(tqdm(<dataloader>)):
<training code>

Note using print() in the section inside “tqdm” does not work well. Use either one
of print() or tqdm, to display information. If not printing info over epochs via
print(), can also use tqdm (in the outer loop) as:
for epoch in tqdm(range(<number of epochs>)):

- trange() can also be used in place of tqdm(range(...)).

7
- Some common Deep NN Architectures:
=> AlexNet (8 layers deep, 1000 categories, input 227x227x3)

Alexnet architecture.

=> ResNet (Residual Network)

=> MobileNet (a family of mobile-first computer vision models, designed to


effectively maximize accuracy while being mindful of the restricted resources for
an on-device or embedded application. MobileNets are small, low-latency, low-
power models parameterized to meet the resource constraints of a variety of use-
cases. They can be built upon for classification, detection, embeddings and
segmentation.
It uses depthwise separable convolutions to significantly reduce the
number of parameters compared to other networks with regular convolutions and
the same depth in the nets. This results in lightweight deep neural networks)

=> EfficientNet (A CNN architecture and scaling method that uniformly scales all
dimensions of depth/width/resolution using a compound coefficient. Unlike
conventional practice that arbitrarily scales these factors, the EfficientNet scaling
method uniformly scales network width(channels), depth(layers), and
resolution(input image size) with a set of fixed scaling coefficients.
The compound scaling method is justified by the intuition that if the input
image is bigger, then the network needs more layers to increase the receptive
field and more channels to capture more fine-grained patterns on the bigger
image.

8
EfficientNets also transfer well.

=> GAN (Generative Adversarial Network - generator-discriminator, 2 networks)

=> CycleGAN (converts back & forth between 2 classes of Networks - 2


generators, 2 discriminators. generators try to outsmart both discriminators)

=> NeuralTalk2 (input image (Conv), output text description of image(Recurrent

9
N))

- Neuron: At its core, it is nothing but a linear transformation of the


input (for example, multiplying the input by a number [the weight] and adding a
constant (bias) followed by the application of a fixed nonlinear function (referred
to as the activation function).

- Activation function: An activation function in deep learning is a


mathematical function (f(x) on input x) applied to the output of each neuron in a
neural network. It helps introduce non-linearities into the network, allowing it to
learn and model complex patterns in data.
An Activation function has 2 uses:
- At the inner parts (layers) of the model, it allows the output
function to have different slopes at different values - something a linear function
by definition cannot do. By trickily composing these differently sloped parts for
many outputs, neural networks can approximate arbitrary functions.
Activation functions are nonlinear. Repeated applications of (w*x+b) without an
activation function results in a function of the same (affine linear) form. The
nonlinearity allows the overall network to approximate more complex functions.
They are differentiable, so that gradients can be computed through them.
- At the last layer of the network, it has the role of concentrating
the outputs of the preceding linear operation into a given range.

- Inductive Bias: In machine learning, inductive bias refers to the


assumptions or preconceptions that a model or algorithm makes about the

10
underlying distribution of data. These biases can influence the model's ability to
learn from a given dataset and can affect the performance of the model on new,
unseen data.
A model with too strong of an inductive bias may fail to capture the
complexity of the underlying data, while a model with too weak of an inductive
bias may overfit the training data.
There are several ways to describe the inductive bias of a model,
including:
- The choice of model architecture.
- The selection of features.
- The type of regularization applied to the model.
For example, a linear regression model has an inductive bias
towards linear relationships between variables, while a decision tree has an
inductive bias towards creating simple, hierarchical partitions of the data.

The inductive bias of a model is a trade-off between its ability to fit the
training data and its ability to generalize to new examples.

- Logarithm: The logarithm of a number n to a given base b is the exponent


to which the base must be raised to obtain the number n. In mathematical
notation, it is expressed as: logb(n)=x.
This can be read as "the logarithm of n to the base b is equal to x." In other
words, bx=n.
Ex: log2(8)=3 (23 = 2*2*2 = 8)
The logarithmic behavior is characteristic of algorithms that divide
problems into smaller subproblems in a systematic way, such as in binary search
or certain divide-and-conquer algorithms.

- Big O notation: It describes the upper bound of the growth rate of an


algorithm's time complexity or space complexity in terms of input size ‘n’.

11
Computation problems are classified into different complexity classes based on
the minimum time complexity required to solve the problem:

1) Polynomial (P): P is the Set of problems that can be solved in


polynomial time using deterministic algorithms.

2) Non-Deterministic Polynomial (NP): Problems in NP are those for


which a solution, once proposed, can be verified quickly by a non-deterministic
algorithm in polynomial time. While the verification is polynomial, finding the
solution itself may not be. In other words, if you give me a solution, I can quickly
check if it's correct, but finding the solution might be computationally hard.

3) NP-Hard: Problems that are "at least as hard as the hardest


problems in NP".

12
4) NP-Complete: Problems that are both NP-hard and in NP. Problems
for which the correctness of each solution can be verified quickly, and a brute-
force search algorithm can actually find a solution by trying all possible solutions.

Note on use of log(n) in Big O: Binary search is a classic


example of a divide-and-conquer algorithm. In binary search, we repeatedly
divide a sorted array in half until we find the target element or determine that it's
not present.
- After one division, the problem size is reduced to half.
- After two divisions, the problem size is reduced to 1/4 1/4 of the original.
- After three divisions, the problem size is reduced to 1/8 1/8.
- After k divisions, the problem size is reduced to 1/2k of the original.
So, if the array size is ‘n’, after ‘k’ divisions, the size becomes n/2k . For binary
search, we want to find the smallest k such that n/2k is less than or equal to 1.
Solving this inequality gives us {k ≥ log2(n)}.
Therefore, the time complexity of binary search is O(log(n)), reflecting the
logarithmic nature of the problem size reduction at each step of the algorithm.
This is a common characteristic of many divide-and-conquer algorithms, where
the problem is recursively divided into smaller subproblems, leading to a
logarithmic time complexity.

- Covariance & Correlation Coefficient:

Variance “𝝈2” of a variable with observations {x1, x2,..., xn} with mean “μ” is
calculated as:
V (𝝈2) = [(x1 - μ)2 + (x2 - μ)2 + .. + (xn - μ)2] / N.
i.e. V (𝝈2) = ΣNi=1 (xi - μ)2 / N.

Standard deviation (𝝈) or std is the square root of variance.


std(𝝈) = √V = √𝝈2.

Variance is less intuitive to interpret because it is in square units, and the


scale is different from the original data. It is also sensitive to outliers.

13
Standard deviation is more interpretable as it is in the same units as the
data. It tells you the average "distance" of data points from the mean. It is less
sensitive to outliers.
Variance is often used in statistical calculations but might not provide as
intuitive a sense of the data's spread.

Mean & Median:


The mean is a simple average: the sum of the observations divided by the
number of observations. (The mean of 3, 4, 5, 6, and 102 is 24).
The median is the midpoint of the distribution (arranged in ascending order); half
of the observations lie above the median and half lie below. (The median of 3, 4,
5, 6, and 102 is 5).
Mean is sensitive to outliers, while median is not. The defining
characteristic of the median is that it does not weight observations on the basis of
how far they lie from the midpoint, only on whether they lie above or below.

Covariance is also a measure of association. Covariance is a measure of the


relationship between two random variables.
Covariance is also used to measure the strength and direction of the relationship
between two random variables.
Formula:

Formula for sample covariance.

Covariance uses a division by (n-1) when calculating on a sample (instead of the


entire population) to correct for bias when using sample means to estimate
population means.

14
This is because using n instead of n−1 would lead to a biased estimate of the
population variance. Dividing by n−1 corrects this bias, accounting for the fact that
you are using the sample mean (x̄ ) to estimate the population mean (μ).
For population covariance, divide by ‘n’, not (n-1).

Covariance number ranges from -∞ to +∞.

Code:

def compute_covariance(x, y):


covar = 0
length = len(x) # number of samples.
mean_x = np.mean(x); mean_y = np.mean(y)
x_mean_diff = x - mean_x; y_mean_diff = y - mean_y
# compute co-variance of x & y together.
covar = np.sum((x_mean_diff * y_mean_diff)) / (length-1)
return covar

Correlation is a measure of association. Correlation is used for bivariate


analysis. It is a measure of how well the two variables are related.

Types of correlation:
1) Positive Correlation: When the value of one variable increases, the
other variable also increases.
2) Negative Correlation: When the value of one variable increases, the
other variable decreases.
3) No Correlation: When there is no linear relationship between two
variables.

15
The measure of correlation is known as the correlation coefficient. The range of
the correlation coefficient is -1 to +1. A scatterplot is used to visualize the
correlation between two numerical variables.

Code:

def compute_correlation(covar, x, y):


correl = covar / (np.std(x) * np.std(y))
return correl

Why correlation is preferred over covariance:

Covariance is a very good measure of association. So why do we need to go for


correlation ?
Covariance depends on units of x and y. If we change the scale of x and y,
then covariance will also be changed:
Covariance units =(units of x) * (units of y)
If we have x and y in cm and if we find covariance, then the units will be cm². Then if we change
the scale to km and find the covariance, the units will be in km². The value of covariance also
will differ. So, covariance depends on scale. It will be difficult to interpret covariance since it is
scale-dependent.

To normalize this and to get rid of units, we use the correlation coefficient.
Correlation Coefficient = Cov(x,y) / (std(x) * std(y))

16
The Correlation Coefficient is calculated by dividing the Covariance of x,y by the
Standard deviation of x and y.
Units of Cov(x,y) = (unit of x)*(unit of y)
Units of the standard deviation of x = unit of x.
Units of the standard deviation of y = unit of y.

Intuition:
If the distance (std) from the mean for one variable tends to be broadly
consistent with distance (std) from the mean for the other variable (e.g. people
who are far from the mean for height in either direction tend also to be far from
the mean in the same direction for weight), then we would expect a strong
positive correlation.
If distance from the mean for one variable tends to correspond to a similar
distance from the mean for the second variable in the other direction (e.g. people
who are far above the mean in terms of exercise tend to be far below the mean
in terms of weight), then we would expect a strong negative correlation.
If two variables do not tend to deviate from the mean in any meaningful
pattern (e.g., patterns of shoe size and exercise) then we would expect little or
no correlation.

So, unit of correlation coefficient = (unit of x)*(unit of y) / (unit of x) (unit of y)


So, in the correlation coefficient formula, units get cancelled. The correlation
coefficient does not have any units. It’s just a number.
Correlation is a standardized measure, which means it is not affected by
the units of the variables. This makes it easier to compare the relationships
between different pairs of variables.

NOTE:

Correlation does not imply causation; a positive or negative association between


two variables does not necessarily mean that a change in one of the variables is
causing the change in the other.
Example (Netflix): First, I rate a set of films. Netflix compares my
ratings with those of other customers to identify those whose ratings are highly
correlated with mine. Those customers tend to like the films that I like (correlation
- note that there is no causal relation here - those other customers do not like the
same movies as mine, because I liked them). Once that is established, Netflix

17
can recommend films that like-minded customers have rated highly but that I
have not yet seen.

std vs standard error:

The standard deviation measures dispersion in the underlying population.


“std” does not have any requirement for the distribution to be normal i.e. follow
the 68-95-99% rule - it can still indicate the dispersion of the data around the
mean.
The standard error (SE) measures the dispersion of the sample means (i.e.
how tightly do we expect the sample means to cluster around the population
mean, given that we have taken multiple groups of samples from the population,
and have computed the mean of each group, as well as the mean of the entire
population).
Relationship between std & standard error: The standard error is the
standard deviation of the sample means.
SE = stdP / √nS.
where
stdP = standard deviation of the population from which the sample is drawn
(for large samples, we can assume that the standard deviation of the sample is
reasonably close to the standard deviation of the population; regardless of what
the distribution of the underlying population looks like).
nS = size of the sample.
A large standard error means that the sample means are spread out
widely around the population mean; a small standard error means that they are
clustered relatively tightly.

When the standard deviation for the population is calculated from a smaller
sample, the formula is tweaked slightly: SE = std / √(n-1).

18
Normal Distribution (Bell curve) & std percentages for the same (68% - 95% - 99.7% rule) .

Pearson’s Correlation Coefficient:


Pearson’s Correlation Coefficient is a type of correlation coefficient that
measures the linear association. It is denoted by r. The value of r ranges from -1
to +1.
Formula:

19
In pandas, covariance & correlation of data in a dataframe df can be computed
using df.cov() & df.corr() respectively.

- Principal Component Analysis (PCA):

Principal Component Analysis is basically a statistical procedure to convert a set


of observations of possibly correlated variables into a set of values of linearly
uncorrelated variables.
Each of the principal components is chosen in such a way that it would
describe most of the available information (variance) in the dataset, and all these
principal components are orthogonal (at right angles) to each other. In all
principal components, the first principal component has a maximum variance.
PCA is most commonly used when many of the variables are highly
correlated with each other and it is desirable to reduce their number to an
independent set. It is commonly used for dimensionality reduction by projecting
each data point onto only the first few principal components to obtain lower-
dimensional data while preserving as much of the data's variation as possible. It
can be shown that the principal components are eigenvectors (a vector which
when operated on by a given operator gives a scalar multiple of itself) of the
data's covariance matrix (Covariance Matrix is a measure of how much two
random variables gets changed together - see pic later). Thus, the principal

20
components are often computed by eigendecomposition of the data covariance
matrix or singular value decomposition (SVD) of the data matrix.

Principal components are constructed in such a manner that the first principal
component accounts for the largest possible variance in the data set.
The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first principal
component and that it accounts for the next highest variance.
This continues until a total of p principal components have been
calculated, equal to the original number of variables.

Eigen Vectors: The eigenvectors of a linear transform are those vectors


that remain pointed in the same directions. For these vectors, the effect of the
transform matrix is just scalar multiplication. For each eigenvector, the
eigenvalue is the scalar that the vector is scaled by under the transform i.e. the
eigenvector is scaled according to the magnitude of the transform applied,
keeping the direction of the eigenvector the same.

Visualizing three vectors through a horizontal scaling. The vectors at 0 & 90 degrees (vertical &
horizontal) are eigenvectors, whereas the one at 45 degrees is not.
Eigenvectors are used in dimensionality reduction.
Ex: Given a set of variables that contain information from a dataset, can we use the
information stored in these variables and extract a smaller set of variables (features) to train a
model and do the prediction while ensuring that most of the information contained in the original
variables is retained/maintained. This will result in simpler and computationally efficient models.
This is where eigenvalues and eigenvectors come into the picture.

21
Uses of PCA:
1) It is used to find interrelations between variables in the data i.e. identifying
patterns.
2) It is used to interpret and visualize data in a simpler way using
“Dimensionality Reduction” i.e. the number of variables is decreasing which
makes further analysis simpler.
3) Noise reduction: PCA can be used to reduce the noise in a dataset by
identifying and removing the principal components that correspond to the noisy
parts of the data.

Computing PCA:

1) Standardization: Standardize the range of the continuous initial


variables so that each one of them contributes equally to the analysis.
x = (value - mean) / std (same as normalization)

2) Covariance matrix: compute covariance matrix (understand how the


variables of the input data set are varying from the mean with respect to each
other, or in other words, to see if there is any relationship between them).
The covariance matrix is a p × p symmetric matrix (where p is the number
of dimensions) that has as entries the covariances associated with all possible
pairs of the initial variables. For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3 data matrix of this from:

Covariance Matrix for 3-Dimensional Data.


Since the covariance of a variable with itself is its variance
(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have
the variances of each initial variable. And since the covariance is commutative
(Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with
respect to the main diagonal, which means that the upper and the lower
triangular portions are equal.

3) Eigenvalues & Eigenvectors: Compute the eigenvectors &


eigenvalues of the covariance matrix to identify the principal components.

22
Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables. These combinations are done in
such a way that the new variables (i.e., principal components) are uncorrelated
and most of the information within the initial variables is squeezed or
compressed into the first components - An important thing to realize here is that
the principal components are less interpretable and don’t have any real meaning
since they are constructed as linear combinations of the initial variables.
Geometrically speaking, principal components represent the directions of
the data that explain a maximal amount of variance, that is to say, the lines that
capture most information of the data. The relationship between variance and
information here, is that, the larger the variance carried by a line, the larger the
dispersion of the data points along it, and the larger the dispersion along a line,
the more information it has.
They always come in pairs, so that every eigenvector has an eigenvalue. And
their number is equal to the number of dimensions of the data. For example, for a
3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors
with 3 corresponding eigenvalues.
The eigenvectors of the Covariance matrix are actually the directions of
the axes where there is the most variance (most information) and that we call
Principal Components. Eigenvalues are simply the coefficients attached to
eigenvectors, which give the amount (magnitude) of variance carried in each
Principal Component. By ranking your eigenvectors in order of their eigenvalues,
highest to lowest, you get the principal components in order of significance.

Computation:
Solve the characteristic equation to find the eigenvalues of the covariance
matrix C. The characteristic equation is given by:
det(C − λI) = 0
where λ is an eigenvalue and I is the identity matrix (diagonals=1, rest=0).
Solve for λ to get the eigenvalues (det() is determinant of matrix - single value).

23
Determinant of a matrix.

After obtaining the eigenvalues, you can find the corresponding eigenvectors by
solving the system of linear equations:
(C − λiI)vi = 0
Here, vi is the eigenvector corresponding to the ith eigenvalue.

Illustration:
Let's consider a 3x3 covariance matrix C for a dataset with three
variables X1, X2, X3:

(i) Eigenvalues: The characteristic equation is det(C − λI) = 0. So, solve for λ:

Solve this determinant equation to find the eigenvalues.


(4−λ)[(5−λ)(6−λ) − (3)(3)] − 2[(2)(6−λ) − (3)(−1)] + (−1)[(2)(3) − (−1)(5-λ)] = 0
→ (4−λ)[(5−λ)(6−λ) − 9] − 2[12 -2λ + 3] + (−1)[6 + 5 − λ] = 0
→ (4−λ)[30 -5λ - 6λ + λ2 - 9] - 2[9 - 2λ] + (-1)[11 - λ] = 0
→ [(120 - 20λ - 24λ + 4λ2 - 36) + (-30λ + 5λ2 + 6λ2 + λ3 - 9λ)] - 18 + 4λ - 11 + λ = 0
→ [λ3 + 15λ2 - 83λ + 84] - 29 + 5λ = 0
→ λ3 + 15λ2 - 78λ + 55 = 0
The number of solutions a polynomial equation can have depends on its
highest degree (3 here). The degree of a polynomial is the highest power of the
variable in the expression. So, there are 3 solutions (for 3 dimensions/variables) -
λi for i=1,2,3.

24
(ii) Eigenvectors: For each eigenvalue λi, solve the system of equations (C −
λiI)vi = 0 to find the corresponding eigenvector vi.

[C3x3] [v3x1] = [0]3x1. [v3x1] is the eigenvector - note that 3 is the number of variables in the dataset .

If we rank the eigenvalues (λi) in descending order, the corresponding


eigenvectors (vi) give the principal components.
After having the principal components, to compute the percentage of variance
(information) accounted for by each component, we divide the eigenvalue of
each component by the sum of eigenvalues.
Ex: ((λ1 / i=1Σi=Nλi)*100) will give the variance of the principal component with λ2.

- Linear Interpolation:

Linear interpolation means we estimate the value using linear polynomials.


Suppose we have 2 points having values 10 and 20 and we want to guess the
values in between them. Simple Linear interpolation looks like this:

1D points:
Value = Value at Lower Bound + (Fractional Distance * Difference in Values)

Value at Lower Bound: The known value at the lower bound (point with a lower coordinate
in 1D).

25
Fractional Distance: The distance between the target point (at Value) and the lower
bound point, divided by the total distance between the two lower(start) & upper(end) points.
Difference in Values: The difference between the values at the upper and lower bounds.

In above example:

Total distance {(0,0) to (0,3)} = (3 - 0) = 3.


Fractional Distance of (0,1) = 1/3.

Difference in values = (20 -10) = 10

Value at (0,1) = 10 + ( (1/3) * (20-10))


= 10 + (1/3 * 10) = 13.333

More weight is given to the nearest value(See 1/3 and 2/3 in the above figure).
For 2D (e.g. images), we have to perform this operation twice, once along rows
and then along columns that's why it is known as Bi-Linear interpolation.

A geometric visualization of bilinear interpolation. The product of the value at the desired point
(black) and the entire area is equal to the sum of the products of the value at each corner and
the partial area diagonally opposite the corner (corresponding colours).

2D points: If the two known points are given by the coordinates (x0, y0)
and (x1, y1), the linear interpolant is the straight line between these points. For a
value x in the interval (x0 , y1), the value y along the straight line is given from
the equation of slopes:

26
- Linear Regression vs Logistic Regression:

Linear Regression is used to handle regression problems whereas Logistic


regression is used to handle the classification problems (or modeling the
probability of a discrete outcome given an input variable - The most common
logistic regression models a binary outcome; something that can take two values
such as true/false, yes/no, and so on. Multinomial logistic regression can
model scenarios where there are more than two possible discrete outcomes).
Regression (including linear regression) provides a continuous output but
Logistic regression provides a discrete output.

- Activation Functions:

27
Sigmoid: When (multiple) output classes are not mutually exclusive
(particular input data can contain all, some or none of the output classes i.e.
output probabilities are independent of each other), then use a sigmoid. The
sigmoid will allow you to have a high probability for all of your classes, some of
them, or none of them. Sigmoid activation is for one per output.
In short: If your model’s output classes are NOT mutually exclusive and you
can choose many of them at the same time, use a sigmoid function on the
network’s raw outputs.
The main reason why we use sigmoid function is because it exists between 0 to
1. Therefore, it is especially used for models where we have to predict the
probability as an output. Used in situations, where the sum of probabilities of
classes/labels for an input data need not sum to 1. Also used for Binary
classification (if {0-1}value is greater than a threshold, then class1; else class2).
Sigmoid can also be used where outputs are continuous, whereas Softmax is
used where outputs are categorical.
If we want to have a classifier to solve a problem with more than one right
answer (i.e. outputs are NOT mutually exclusive), the Sigmoid Function is the
right choice, applied to each raw output independently.
Ex: input image might contain dog, cat, horse, or none of them.
Mathematical formula:
s(x) = 1 / (1 + e(-x))

The function maps any input value to a value between 0 and 1. The
range of the function is (0,1), and the domain is (-infinity,+infinity).

Softmax: Softmax is used to normalize the output of a network/model to


a probability distribution over predicted (multiple) output classes. It is a
generalization of sigmoid. By definition, the softmax activation will output one
value for each node in the output layer. The output values (together) will
represent (or can be interpreted as) probabilities and the probability values of all
the output sum to 1.0 (i.e. probabilities are interrelated – at most 1 output class
can be true).
The difference between sigmoid & softmax is that Sigmoid is used for
binary classification methods where we only have 2 classes, while SoftMax
applies to multiclass problems. In fact, the SoftMax function is an extension of
the Sigmoid function.
Softmax takes vectors of real numbers as inputs, and normalizes them into a

28
probability distribution proportional to the exponentials of the input numbers. After
applying Softmax, each element will be in the range of 0 to 1, and the elements
will add up to 1 (outputs ARE mutually exclusive).
Ex: input image of a single digit can only be one of {0 to 9}.
Hence, Unlike other activations, softmax is performed on top (i.e. taking
into consideration all the output values) of an array of values.
Equation:

Code:
def softmax(x):
return np.exp(x) / np.sum(np.exp(x))
# “np.sum(np.exp(x))” is taking all the raw output values,
where x is the vector of outputs from NN.
It returns a vector, same size as “x”, that contains probabilities for each
element in x.

Here, the Z represents the values from the neurons of the output layer. The exponential acts as
the nonlinear function. Later these values are divided by the sum of exponential values in order
to normalize and then convert them into probabilities.
“j” represents the number of neurons in the output layer (over the entire length of the “z” input
vector).
Ex:

29
Tanh: tanh is also like logistic sigmoid but better. The range of the tanh
function is from (-1 to 1). tanh is also sigmoidal (s - shaped). The advantage is
that the negative inputs will be mapped strongly to negative and the zero inputs
will be mapped near zero in the tanh graph (Ex: can be used in prediction of
bounding boxes (say, relative to object center - start point will be negative value)
in object detection; where predicted values can be negative too).
Equation:
tanh(x) = sinh(x) / cosh(x) = (ex − e−x) / (ex + e−x)

Properties of tanh:
(1) Negative value of the angle gives negative value of the tan function, tan − x = −
tan x.

30
(2) It is a periodic function and its period is π.
(3) It is symmetric about the origin.
(4) Its domain is a set of all real values except x = π/2 + n*π , where n is an
integer.
(5) Its range is (−∞,∞) .

ReLU (Rectified Linear Unit): the ReLU is half rectified (from bottom). f(z)
is zero when z is less than zero and f(z) is equal to z when z is above or equal to
zero.
Equation: f(x) = max(0, x) returns the larger of (0, x).

Leaky ReLU: With a Leaky ReLU (LReLU), you won’t face the “dead ReLU”
(or “dying ReLU”) problem which happens when your ReLU always have values
under 0 - this completely blocks learning in the ReLU because of gradients of 0 in
the negative part. So:
ReLU: The derivative of the ReLU is 1 in the positive part, and 0 in the
negative part.
LReLU: The derivative of the LReLU is 1 in the positive part, and is a small
fraction in the negative part.
Now, think about the chain rule in the backward pass. If the derivative of
the slope of the ReLU is of 0, absolutely no learning is performed on the layers
below the dead ReLU, because 0 will be multiplied to the accumulated gradient
for the weight update. Thus, you can have dead neurons. This problem doesn’t
happen with LReLU or ELU for example, they will always have a little slope to
allow the gradients to flow on.
Equation:
f(x) = max(0.01*x , x)
This function returns x if it receives any positive input, but for any negative value
of x, it returns a really small value which is 0.01 times x. Thus it gives an output
for negative values as well.

- GELU: The GELU (Gaussian Error Linear Unit) activation function is a non-
linear activation function that weights the input by its probability under a
Gaussian/Normal distribution.
Mathematical Expression:

31
GELU(x) = (x/2) * (1 + erf(x/√2)) # where erf denotes the error function.
= 0.5x * (1 + tanh(sqrt(2/pi) * (x + 0.044715x^3)))
The GELU function is based on the cumulative distribution function of a
Gaussian (normal) distribution. It smoothly approximates the ReLU function while
being differentiable everywhere.
GELUs are used in GPT-3, BERT, and most other Transformers. If you
combine the effect of ReLU, zone out (maintain previous value - a method for
regularizing RNNs), and dropout, you get GELU.

Activation functions like ReLU, ELU and PReLU have enabled faster and better
convergence of Neural Networks than sigmoids.
Also, Dropout regularizes the model by randomly multiplying a few activations by
0.
Both of the above methods together decide a neuron’s output. Yet, the two work
independently from each other. GELU aims to combine them.
NOTE: Zero centered activation functions ensure that the mean
activation value is around zero, hence can easily map the output values as
strongly negative, neutral, or strongly positive. Ex: Tanh.
Sigmoid, ReLU, GELU are not zero centered functions.

Properties of GELU:
Smoothness: GELU is smooth and continuous, which makes it suitable for
gradient-based optimization algorithms like backpropagation.
Range: GELU outputs values in the range [0, 1], similar to sigmoid,
but with a wider range of non-linearity around 0.
Saturation: GELU does not suffer from the vanishing gradient problem as
much as sigmoid, especially in deep networks.

Differences Compared to Other Activation Functions:

32
Compared to ReLU: GELU is smooth and approximately zero-
centered (i.e. its expected value is close to zero when applied to a large number
of inputs), addressing the "dying ReLU" (output is 0 when input <=0) problem
where neurons can become inactive during training, however GELU has a non-
zero (i.e. differentiable) gradient at input = 0, which allows the network to learn in
this region. It is often preferred for deeper networks.
Compared to Sigmoid: GELU provides a wider range of non-linearity and
typically converges faster during training.
Compared to Tanh: GELU is similar to tanh in terms of smoothness
and saturation characteristics but has a different shape and can outperform tanh
in certain scenarios.

- Logits: A Logit function, also known as the log-odds function, is a


function that represents probability values from 0 to 1, and negative infinity to
infinity. The function is an inverse to the sigmoid function that limits values
between 0 and 1 across the Y-axis, rather than the X-axis. Because the Logit
function exists within the domain of 0 to 1, the function is most commonly used in
understanding probabilities.
The Logit function is represented as:
logit(x) = log(x / 1−x)
If X represents a probability, then X/(1-X) is the odds, and the Logit function is
the logarithm of the odds. The function plots across the graph within the domain
of 0 to 1, and produces real numbers ranging from negative infinity to infinity.

The Logit function is used similarly to the sigmoid function in neural


networks. The sigmoid, or activation, function produces a probability, whereas
the Logit function takes a probability and produces a real number between
negative and positive infinity. Like the sigmoid function, Logit functions are often
placed as the last layer in a neural network as they can simplify the data.
For example, a Logit function is often used in the final layer of a neural network
used in classification tasks. As the network determines probabilities for
classification, the Logit function can transform those probabilities to real
numbers.

33
NOTE: Logits in DL can also be interpreted as the raw NN outputs; that are
unnormalized, before being fed to an activation function such as sigmoid or
softmax.

NOTE:
Probability vs Odds of an outcome: The probability that an event will
occur is the fraction of times you expect to see that event in many trials.
Probabilities always range between 0 and 1.
The odds of an outcome are the ratio of the probability that the
outcome occurs to the probability that the outcome does not occur i.e. { p /
(1 - p) }.
Probabilities between 0 and 0.5 equal odds less than 1.0. A probability of
0.5 is the same as odds of 1.0. Think of it this way: The probability of flipping a
coin to heads is 50%. The odds are “fifty: fifty,” which equals 1.0. As the
probability goes up from 0.5 to 1.0, the odds increase from 1.0 to approach
infinity. For example, if the probability is 0.75, then the odds are 75:25, three to
one, or 3.0.
If the odds are high (million to one), the probability is almost 1.0. If the
odds are tiny (one to a million), the probability is tiny, almost zero.

Probability vs. odds:


1) Probability typically appears as a percentage, while you can express odds
as a fraction or ratio.
2) Probability uses a range that only exists between the numbers zero and
one, while odds use a range that has no limits.

34
3) Calculating probability considers all potential outcomes of an event, while
calculating odds involves comparing the number of desired outcomes against the
number of possible unwanted outcomes.

Log-odds of an outcome: obtained by taking the natural logarithm (log to


the base e) of the odds i.e. ln(odds).

Log_odds between value range -5 to 5.

NOTE: Width (number of neurons in a layer) of a NN makes it more capable


(better representational power). The greater the capacity, the more variability
in the inputs the model will be able to manage; but at the same time, the more

35
likely overfitting will be, since the model can use a greater number of parameters
to memorize unessential aspects of the input.
Depth (number of layers) allows a model to deal with hierarchical information
(increased complexity) when we need to understand the context in order to say
something about some input.
For ex: In regard to computer vision, a shallower network could identify a
person’s shape in a photo, whereas a deeper network could identify the person,
the face on their top half, and the mouth within the face.
Adding depth to a model generally makes training harder to converge (due
to vanishing / exploding gradients problem, later resolved by Resnet
architectures via the ReLU activation units).

- log-probabilities: a log-probability is simply the logarithm of a


probability. Instead of representing probabilities on the standard unit interval
(between 0 and 1), log-probabilities represent them on a logarithmic scale.
There are several reasons why log-probabilities are useful:
1. Numerical stability: Computers can be limited when dealing with very
small numbers. Multiplying several small probabilities can lead to underflow,
where the result is rounded down to zero. Logarithms convert multiplication
into addition, which is a more stable operation for computers. This is
especially important in deep learning, where many small probabilities are often
involved.
2. Easier manipulation: Logarithms turn multiplication into addition, which
is often easier to work with mathematically. For example, the probability of two
independent events occurring together is the product of their individual
probabilities. In log space, this becomes the sum of their individual log-
probabilities.

Converting from log-probabilities back to regular probabilities is done by taking


the exponential of the log-probability.

- A loss function (or cost function) is a function that computes a single


numerical value that the learning process will attempt to minimize.
Conceptually, a loss function is a way of prioritizing which errors to fix from our
training samples, so that our parameter updates result in adjustments to the

36
outputs for the highly weighted samples instead of changes to some other
sample’s output that had a smaller loss.
For example: The square difference ((trueValue - predictedValue)^2)
penalizes wildly wrong results more than the absolute difference((trueValue -
predictedValue)) does. Often, having more slightly wrong results is better than
having a few wildly wrong ones, and the squared difference helps prioritize those
as desired.
In pytorch, we can get the scalar loss value of a loss object for
plotting in a graph.
Ex:
lossFn = nn.MSELoss()
loss = lossFn(predictions, labels)
loss.backward()
lossValue = loss.item() # get the scalar loss value.
lossValuesList.append(lossValue) # plot lossValuesList in graph later.

Another common loss type is nn.CrossEntropyLoss(), that is used for discrete


multi class classification. CrossEntropy combines nn.LogSoftmax() and
nn.NLLLoss() (negative log likelihood loss - useful to train a classification
problem with C classes) in one single class, hence no need to use softmax
activation separately.

Equations for some common Losses:

(a) Binary Cross Entropy Loss (BCE or Log Loss):


Binary Cross Entropy is the negative average of the log of
corrected predicted probabilities.
Binary cross entropy compares each of the predicted probabilities to
actual class output which can be either 0 or 1. It then calculates the score that
penalizes the probabilities based on the distance from the expected value. That
means how close or far from the actual value.

37
Corrected probability is the probability that a particular observation belongs to its original class.
As shown in the above image, ID6 originally belongs to class 1 hence its predicted probability
and corrected probability is the same i.e 0.94. On the other hand, the observation ID8 is from
class 0. In this case, the predicted probability i.e the chances that ID8 belongs to class 1 is 0.56
whereas, the corrected probability means the chances that ID8 belongs to class 0 is ( 1-
predicted_probability) is 0.44.

The reason behind using the log value is, the log value offers less penalty for
small differences between predicted probability and corrected probability. When
the difference is large the penalty will be higher.
Since all the corrected probabilities lie between 0 and 1, all the log values are
negative. In order to compensate for this negative value, we will use a negative
average of the values.

pi is the corrected probability. N is the number of classes.

Another form is:

Here, pi is the probability of class 1, and (1-pi) is the probability of class 0. When the observation
belongs to class 1 the first part of the formula becomes active and the second part vanishes
and vice versa in the case observation’s actual class is 0. This is how we calculate the Binary
cross-entropy.

38
(b) Cross Entropy Loss (also called logarithmic loss, log loss or logistic loss):
The purpose of the Cross-Entropy is to take the output
probabilities (P) and measure the distance from the truth values.

For the example above, the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] . The objective is to make the model
output be as close as possible to the desired output (truth values).

ti represents the true class label for class ‘i’ (1 if the true class is ‘i’, 0 otherwise).

Each predicted class probability is compared to the actual class desired


output 0 or 1 and a score/loss is calculated that penalizes the probability
based on how far it is from the actual expected value.
The penalty is logarithmic in nature yielding a large score for large differences
close to 1 and small score for small differences tending to 0.

39
Computing Cross Entropy Loss for above example.
Ex: if the GT is 1, & predicted probability is 0.8, then CE_Los = -(1 *
ln(0.8)) = -(1 * -0.223) = 0.223. (since 0.8 is closer to GT 1, loss is less(0.223)).
if the GT is 1, & predicted probability is 0.2, then CE_Los = -(1 * ln(0.2)) = -(1 * -
1.609) = 1.609. (since 0.2 is farther from GT 1, loss is more(1.609)).

Reason for using a negative sign is the same as in BCE loss.

Note on CrossEntropyLoss() in Pytorch: The performance of this


criterion is generally better when the TARGET (i.e. labels/ground truth) contains
class indices (as integer indices (1D tensor) or as one hot encoding (2D tensor)),
as this allows for optimized computation. Consider providing the target as class
probabilities only when a single class label per mini-batch item is too restrictive.
It seems that labels can contain class indices(1D tensor of 0 to (C-1) values for C
classes),
BUT
PREDICTIONS should have probabilities for each class (& not 1D tensor of class
indices), so that learning can occur. Using class indices (1D tensor) instead of
probabilities (in 2D tensor), when using CrossEntropyLoss doesn't work on
predictions, as when we use torch.argmax() or related functions to get class
indices; they are not differentiable (no gradients), hence learning cannot be
performed.

(c) Mean Squared Error Loss (MSE):

Mean squared error (MSE) loss is calculated by averaging over squaring the
difference between true value y (GT) and the predicted value 𝑦̂.

N is the number of observations/data points.

MSE is between 0 & infinity.

(d) Mean Absolute Error Loss (MAE):

40
Similar to MSE, but using the absolute difference value between GT & predicted
value:

This treats all differences in error of data points as equal, hence more robust to
outliers (some data points that have huge errors).

- Metrics:

Evaluation metrics are used to assess the performance of a trained machine


learning model on a dataset. Unlike loss functions, which are used during
training, evaluation metrics are used after training to measure how well the model
generalizes to new, unseen data.
Evaluation metrics provide a way to quantitatively measure the quality of a
model's predictions in a way that is meaningful for a specific problem. They are
used to compare different models, select the best model, and communicate the
model's performance to stakeholders.

Loss functions are used during the training phase, whereas Metrics is used
during validation & testing phase. It's important to choose appropriate loss
functions and evaluation metrics that align with the goals of your machine
learning task.
Commonly used metrics: Accuracy, confusion matrix, log-loss (CrossEntropy
loss), and AUC-ROC (Area Under Curve - Receiver Operator Characteristics),
etc.

A confusion matrix (a matrix used to determine the performance of a


classification model for a given set of test data) displays the number of true
positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)
produced by the model on the test data.
It depicts the number of times the rare object class has occurred and the number
of times the model predicted the rare object class correctly.
A typical confusion matrix looks as follows:

41
Example Code:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

class_names = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
cm = confusion_matrix(labels, predictions, labels=class_names)
ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=class_names).plot() # Confusion Matrix visualization.

Sample Confusion matrix with numbers.

- Backpropagation:
This is a reverse step of forward pass: We start with the
loss value obtained in feedforward propagation and update the weights “w” (in y

42
= w*x + b, where y = output, x = input, b = bias) of the network in such a way that
the loss value is minimized as much as possible, in following way:
(1) Compute the (original) loss value obtained in feedforward propagation
(with original weights w0).
(2) Change each weight within the neural network by a small amount – one at
a time & compute the new loss.
(3) Measure the change in loss (δL i.e. difference in loss before & after weight
change) when the weight value is changed (δW) i.e. gradient (δL / δW)
(gradients can be positive or negative, thus will result in increase or decrease of
weights during updation).
(4) Update the (original) weight (& bias) (Gradient Descent) by ( -⍺ * (δL / δW)
) (where ⍺ (alpha) is a positive value and is a hyperparameter known as the learning
rate).
Formula: w0 = w0 - ( ⍺ * (δL / δW) )
“w” is the parameter, e.g., the weight in a neural network, and is the
objective, e.g., the loss function. What it does is to move to the direction that you
can minimize the loss. The direction is provided by the differentiation (δL / δW),
but how much you should move is controlled by the learning rate ⍺.

Note that the amount of update made to a particular weight is proportional to the
amount of loss that is reduced by changing it by a small amount.
Intuitively, if changing a weight reduces the loss by a large value, then we
can update the weight by a large amount. However, if the loss reduction is
small by changing the weight, then we update it only by a small amount.
i.e.
(1) compute loss (L_1) with current weights
(2) modify weights (add by a constant C) & recompute new loss (L_2)
(3) compute gradient = (L_2 - L_1) / C.
(4) update weights according to gradient:
updated_w -= gradient * learning_rate.

Chain rule: Computing grads for each weight separately for a huge
network can be computationally expensive. Hence we can use chain rule in
backpropagation:
Performing a chain of differentiations to fetch the differentiation of our
interest i.e. computing gradient of weights at a layer, & then using that gradient to

43
compute gradient of weight in the previous layer, thus avoiding recomputations of
gradients.

- Data Shift: Data shift is a common problem that afflicts the ML


models in which the distribution of the database used for testing the performance
of the learning models or systems (i.e. test set) may differ from the distribution of
the training data. This may occur when the data acquisition conditions or the
systems that are used for collecting the test data change from when the training
dataset was acquired; & could induce
(a) a covariate shift (a shift in the distribution in the covariates).
(b) a prior probability shift, that is, a difference in the distribution of the target
variable.
(c) a domain shift, that is, a change in measurement systems or methods.

- Data Leakage: Data leakage is a major problem in ML, in which data


outside of the training set seeps into the model while building the model. This
event could lead to an error-prone or invalid ML model.

- Bootstrapping: This method works by randomly sampling observations


from a database (possibly with augmentations) to form a training set whose size
is equal to the original database. As a result, some of the observations can
appear several times in the training set, while some may never be selected.
The latter observations are called “out-of-bag” and are used to test the
learning algorithm. This process is repeated multiple times to estimate the
learning model’s generalization performance.
Although bootstrapping tends to drastically reduce the variance, it often tends to
provide more biased results, more importantly when dealing with small sample
sizes.
Bootstrapping for Semi-Supervised Learning: In scenarios where labeled
data is limited, bootstrapping techniques are used to generate pseudo-labels for
unlabeled data. This process involves training a model on the available labeled
data, using it to predict labels for the unlabeled data, and then incorporating
these predictions as pseudo-labels for further training.

44
Bootstrapping for Reinforcement Learning: In reinforcement learning,
bootstrapping techniques involve estimating future rewards or values by
iteratively updating and refining these estimates based on current observations
and actions.

- Ensemble Learning:
Ensemble learning involves combining multiple individual models to
produce a stronger, more robust model that typically performs better than any
single constituent model.
Bucket of models: A "bucket of models" is an ensemble technique in
which a model selection algorithm is used to choose the best model for each
problem. When tested with only one problem, a bucket of models can produce no
better results than the best model in the set, but when evaluated across many
problems, it will typically produce much better results, on average, than any
model in the set.
Gating: It involves training another model to decide which of the
models in the bucket (i.e. list of models) is best suited to solve the problem.
Often, a perceptron is used for the gating model. It can be used to pick the "best"
model, or it can be used to give a linear weight to the predictions from each
model in the bucket.

Ensemble learning in deep learning is used to reduce overfitting, improve


generalization, and enhance the overall performance of models by leveraging
diverse predictions and capturing different aspects of the data.
However, building and training ensemble models can be computationally
expensive due to the need to train multiple models, but they often provide better
predictive performance.

- Learning Rate Scheduler:

A scheduler is to make the learning rate adaptive to the gradient descent


optimization procedure, so you can increase performance and reduce training
time.
You can update the learning rate as frequently as each step but usually it is
updated once per epoch because you want to know how the network performs in

45
order to determine how the learning rate should update. Regularly, a model is
evaluated with a validation dataset once per epoch.

There are multiple ways of making learning rate adaptive. At the beginning of
training, you may prefer a larger learning rate so you improve the network
coarsely to speed up the progress.

Different Learning Rate Schedulers. Note that schedulers like Cyclic & Cosine
have the ability to again increase the LR, after it has been decreased.

In a very complex neural network model, you may also prefer to gradually
increase the learning rate at the beginning because you need the network to
explore the different dimensions of prediction. At the end of training, however,
you always want to have the learning rate smaller. Since at that time, you are
about to get the best performance from the model and it is easy to overshoot if
the learning rate is large.
Therefore, the simplest and perhaps most used adaptation of the learning
rate during training are techniques that reduce the learning rate over time.
There are many learning rate schedulers provided by PyTorch in the
torch.optim.lr_scheduler submodule. All the schedulers need the optimizer to
update as the first argument. Depending on the scheduler, you may need to
provide more arguments to set up one.

Note that even though you initially set a learning rate in the optimizer, the
scheduler ultimately determines the learning rate used during optimization. This

46
allows you to implement various learning rate decay schedules and fine-tune the
training process.

<scheduler>.get_last_lr(): returns last computed learning rate (in a


list) by current scheduler. This can be used for logging i.e. in tensorboard.

Different LRS:

- One Cycle LR:

In a one-cycle learning rate policy, the learning rate is adjusted over the course
of training in a cyclical pattern. The policy typically consists of two phases:

Warm-up Phase (Increasing Learning Rate): At the beginning of training, the


learning rate is initially set to a small value, and it gradually increases in a linear
or geometric manner over a certain number of iterations or epochs. This helps
the model to converge faster during the initial phase of training.

Annealing Phase (Decreasing Learning Rate): After the warm-up phase, the
learning rate is gradually decreased. This phase may also be implemented in a
linear or geometric manner. During this phase, the learning rate becomes smaller
and allows the model to fine-tune and generalize better.

OneCycleLR() anneals the learning rate from an initial learning rate to some
maximum learning rate (max_lr) and then from that maximum learning rate to
some minimum learning rate much lower than the initial learning rate.

In PyTorch, you can implement a one-cycle learning rate policy using a learning
rate scheduler such as torch.optim.lr_scheduler.OneCycleLR(<optimizer>,

47
max_lr=, total_steps=None, epochs=None, steps_per_epoch=None,
pct_start=0.3, anneal_strategy='cos', cycle_momentum=True,
base_momentum=0.85, max_momentum=0.95, div_factor=25.0,
final_div_factor=10000.0, …).

Note also that the total number of steps (iterations) in the cycle can be
determined in one of two ways (listed in order of precedence):
A value for total_steps is explicitly provided.
A number of epochs (epochs) and a number of steps per epoch
(steps_per_epoch) are provided. In this case, the number of total steps
(iterations) is inferred by total_steps = epochs * steps_per_epoch.
You must either provide a value for total_steps or provide a value for both
epochs and steps_per_epoch.
(steps_per_epoch = len(<train-set dataloader>))

The One cycle learning rate policy changes the learning rate after every batch.
step should be called after a batch has been used for training (i.e. after each
iteration).

“pct_start” is the percentage of the cycle (in number of steps) spent increasing
the learning rate (default = 0.3).
“div_factor” determines the initial learning rate via initial_lr = max_lr/div_factor
(default = 25).
“final_div_factor” determines the minimum learning rate via min_lr =
initial_lr/final_div_factor (default = 1e4).
To be exact, the learning rate will increase from an initial lr value = (max_lr /
div_factor) to max_lr in the first (pct_start * total_steps) steps (iterations); and
decrease smoothly to a (final) minimum lr value (initial_lr / final_div_factor)
then.
Ex Code:
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01,
steps_per_epoch=len(data_loader), epochs=10)
for epoch in range(10):
for batch in data_loader:
train_batch(...)
optimizer.step()
scheduler.step() # called every step (iteration).

48
- Lambda LR:

Sets the learning rate of each parameter group (of optimizer) to the initial lr times
a given function (lr_lambda). When last_epoch=-1, set initial lr as lr.

𝑙𝑟epoch = 𝑙𝑟initial ∗ 𝐿𝑎𝑚𝑏𝑑𝑎(𝑒𝑝𝑜𝑐ℎ)

Code:

lambda1 = lambda epoch: 0.65 ** epoch


scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)
for i in range(10):
<some training code>
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"]) # for plotting.
scheduler.step()

plt.plot(range(10),lrs)

Lambda function in python:


Syntax: lambda arguments : expression
The expression is executed and the result is returned:
Ex:
x = lambda a : a + 10
print(x(5))

49
Output:
15

Also see Multiplicative LR - formula: 𝑙𝑟epoch = 𝑙𝑟epoch-1 ∗ 𝐿𝑎𝑚𝑏𝑑𝑎(𝑒𝑝𝑜𝑐ℎ)


Multiply the (previous) learning rate of each parameter group by the
factor given in the specified function. When last_epoch=-1, set initial lr as lr.

- Linear LR:

import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler

lr = 0.1 # original learning rate.


scheduler = lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5,
total_iters=30)
In the above, LinearLR() is used. It is a linear rate scheduler and it takes
three additional parameters, the start_factor, end_factor, and total_iters. You set
start_factor to 1.0, end_factor to 0.5, and total_iters to 30, therefore it will make a
multiplicative factor decrease from 1.0 to 0.5, in 30 equal steps. After 30 steps,
the factor will stay at 0.5. This factor is then multiplied to the original learning rate
at the optimizer. Hence you will see the learning rate (0.1 initially) decreased
from (0.1 * 1.0 = 0.1) to (0.1 * 0.5 = 0.05).
Output (training epochs = 50):
# lr scheduler automatically calculates how much to (equally) reduce, so that by the end of 30
epochs, lr is 0.05.
Epoch 0: SGD lr 0.1000 -> 0.0983
Epoch 1: SGD lr 0.0983 -> 0.0967
Epoch 2: SGD lr 0.0967 -> 0.0950
Epoch 3: SGD lr 0.0950 -> 0.0933
Epoch 4: SGD lr 0.0933 -> 0.0917
...
Epoch 28: SGD lr 0.0533 -> 0.0517
Epoch 29: SGD lr 0.0517 -> 0.0500
Epoch 30: SGD lr 0.0500 -> 0.0500 # After epoch=30, lr remains constant.
Epoch 31: SGD lr 0.0500 -> 0.0500
...
Epoch 48: SGD lr 0.0500 -> 0.0500
Epoch 49: SGD lr 0.0500 -> 0.0500

50
- Exponential LR:

scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.99)

Output (initial lr = 0.1):


Epoch 0: SGD lr 0.1000 -> 0.0990 # 0.1(current lr) * 0.99(gamma) = 0.099
Epoch 1: SGD lr 0.0990 -> 0.0980 # 0.099 * 0.99(gamma) = 0.09801
Epoch 2: SGD lr 0.0980 -> 0.0970 # 0.09801 * 0.99(gamma) = 0.09702
Epoch 3: SGD lr 0.0970 -> 0.0961 # 0.09702 * 0.99 = 0.09604
Epoch 4: SGD lr 0.0961 -> 0.0951
...
Epoch 45: SGD lr 0.0636 -> 0.0630
Epoch 46: SGD lr 0.0630 -> 0.0624
Epoch 47: SGD lr 0.0624 -> 0.0617
Epoch 48: SGD lr 0.0617 -> 0.0611
Epoch 49: SGD lr 0.0611 -> 0.0605
The learning rate is updated by multiplying with a constant factor gamma in
each scheduler update. The gamma value should be less than 1 to reduce the
learning rate.

- Cyclic LR (triangular2):

Sets the learning rate of each parameter group according to cyclical learning rate
policy (CLR). The policy cycles the learning rate between two boundaries with a
constant frequency, as detailed in the paper Cyclical Learning Rates for Training
Neural Networks. The distance between the two boundaries can be scaled on a
per-iteration or per-cycle basis.
The learning rate values change in a cycle from more minor to higher and
vice versa. This method helps the model get out of the local minimum or a saddle
point while not skipping the global minimum.

The general algorithm for CyclicLR is the following:


- Set the minimum learning rate
- Set the maximum learning rate
- Let the learning rate fluctuate between the two thresholds in cycles

Code:
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

51
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001,
max_lr=0.1,step_size_up=5,mode="triangular2")
lrs = []

for i in range(100):
optimizer.step() # optimizer step.
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step() # update learning rate - done per step (or batch).

plt.plot(lrs)

Code:

import torch
model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate,
weight_decay=0.01, amsgrad=False)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr,
step_size_up=2000, step_size_down=None, mode='triangular',
gamma=1.0,cycle_momentum=false)

for epoch in range(20):


for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step() # optimizer step.

52
scheduler.step() # update learning rate - done per epoch.

NOTE: If there are extremely large batches then can call


scheduler.step() after a few batches. If the number of batches is smaller, then do
it in a few epochs. Never do per batch, as that is too frequent.

Parameters:

base_lr: The initial learning rate, which is the lower boundary of the
cycle.

max_lr: The maximum learning rate, which is the higher boundary of


the cycle.
The cycle amplitude is defined as (max_lr - base_lr). The learning rate at
any cycle is the sum of base_lr and some amplitude scaling. Therefore, max_lr
may not even be reached in some cases, depending on the scaling function.
The step size reflects; in how many epochs the learning rate will reach from one
bound to the other.

step_size_up: The number of training iterations (not epochs) passed when


increasing the learning rate from Base LR to Max LR. A smaller value means
that LR increases quickly (in lesser iterations), from base LR to Max LR.

step_size_down: The number of training iterations passed when decreasing the


learning rate from Max LR to Base LR.
A greater value means that it takes more iterations to reach from Max LR to
Base LR i.e. rate of decreasing of LR is slower.
If Step Size Down is set to null, then its value is set to that of Step Size Up.

mode: There are different techniques in which the learning rate can vary
between the two boundaries:
- ‘triangular’: In this method, we start training at the base learning
rate and then increase it until the maximum learning rate is reached. After
that, we decrease the learning rate back to the base value. Increasing and
decreasing the learning rate from min to max and back take half a cycle
each.

53
- ‘triangular2’: In this method, the maximal learning rate
threshold is cut in half every cycle. Thus, you can avoid getting stuck in the
local minima/saddle points while decreasing the learning rate.

- ‘exp_range’: As well as the ‘triangular2’, this method allows


you to decrease the learning rate, but more gradually, aiming at
exponential decay.

54
gamma: The constant variable in the ‘exp_range’ scaling function - a
multiplicative factor by which the learning rate is decayed. For instance, if the
learning rate is 1000 and gamma is 0.5, the new learning rate will be 1000 x 0.5
= 500.
The gamma value should be less than 1 to reduce the learning rate.

scale_mode: Defines whether the scaling function is evaluated on cycle


number or cycle iterations (training iterations since the start of the cycle):
- ‘cycle’ (multiple of step_size_up/down)
- ‘iterations’ (enables to choose for scaling within a cycle - for more frequent
scaling)

base_momentum: Lower momentum boundaries in the cycle for each


parameter group.
Note that momentum is cycled inversely to the learning rate. At the cycle’s
peak, momentum is ‘base_momentum,’ and the learning rate is ‘max_lr’.
Momentum can smooth the learning algorithm 's progression and, in effect, will
accelerate the training cycle i.e. model achieves good accuracy in lesser cycles.

max_momentum: Upper momentum boundaries in the cycle for each


parameter group. Functionally, it defines the cycle amplitude (max_momentum -
base_momentum). The momentum at any cycle is the difference between
max_momentum and some scaling of the amplitude; therefore, base_momentum
may not actually be reached depending on the scaling function.
Note that momentum is cycled inversely to learning rate; at the start of a cycle,
momentum is ‘max_momentum’, and learning rate is ‘base_lr’.

55
- CosineAnnealingWarmRestarts:

CosineAnnealingLR continuously reduces (& increases) the learning rate in a


cosine-shaped manner, potentially without any resets.

Notice the gradual increase in LR, after LR reaches floor (least) value.

CosineAnnealingLR is useful for finding good minima in the loss landscape


and can help improve the generalization of the model by exploring different
regions of the parameter space.

CosineAnnealingWarmRestarts periodically resets the learning rate to its initial


value, allowing it to explore different parts of the loss landscape in a cyclical
fashion.

Notice the instant ‘reset’ of LR to original value, after LR reaches floor value.

CosineAnnealingWarmRestarts sets the learning rate of each parameter


group using a cosine annealing schedule, and restarts after Ti epochs.

56
The CosineAnnealingWarmRestarts Scheduler requires some extra steps to
function properly.
torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0,
T_mult=1, eta_min=0, last_epoch=-1, verbose=False)
Set the learning rate of each parameter group using a cosine annealing
schedule, where ηmax is set to the initial lr, Tcur is the number of epochs since the
last restart and Ti is the number of epochs between two warm restarts in SGDR:
When Tcur = Ti, set ηt = ηmin. When Tcur= 0 after restart, set ηt = ηmax.

T_0 (int): Number of iterations for the first restart.


T_mult (int, optional): A factor that increases Ti after a restart. Default: 1.
eta_min (float, optional): Minimum learning rate. Default: 0.
last_epoch (int, optional): The index of the last epoch. Default: -1.
verbose (bool): If True, prints a message to stdout for each update. Default =
False.

Ex: If T_0 = 3, T_mult = 1 and eta_min = 0.0001 & initial learning rate =
0.001; the scheduler will start with an initial learning rate of 0.001 and
reduce it to 0.0001 in every 3 epochs. Then, it'll start again with a learning rate of
0.001 and decrease it to 0.0001 in 3 epochs.

In order to properly change Learning Rate for longer training, you should
ideally pass the epoch number while invoking the step() function. Like so:
Code:
optimizer = ...
scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, ...)
iters = len(dataloader)

for epoch in range(...):


for i, sample in enumerate(dataloader):
# Forward Pass
# Compute Loss and Perform Back-propagation
# Update Optimizer
optimizer.step()
scheduler.step(epoch + i / iters) # pass the epoch number.

57
NOTE: If passing the epoch number is not done, this usually leads to a
more erratic change of learning rate, and not gradual and smooth as expected.

- AutoGrad: PyTorch tensors can remember where they come from,


in terms of the operations and parent tensors that originated them, and they can
automatically provide the chain of derivatives of such operations with respect to
their inputs (useful in backpropagation). This means we won’t need to derive our
model by hand; given a forward expression, no matter how nested, PyTorch will
automatically provide the gradient of that expression with respect to its input
parameters.
This can be enabled on tensors by passing requires_grad=True arg in tensor
construction.
Example: params = torch.tensor([1.0, 0.0], requires_grad=True)
Any tensor that will have ‘params’ as an ancestor will have access to the chain of
functions that were called to get from ‘params’ to that tensor. In case these
functions are differentiable (and most PyTorch tensor operations will be), the
value of the derivative will be automatically populated as a grad attribute of the
params tensor.
In general, all PyTorch tensors have an attribute named grad. Normally, it’s
None.
To disable the grad of given parameters, get it’s parameter from the
<object>.parameters() & set it to false:
for param in <entity>.parameters():
param.requires.grad = False
If <entity> is a tensor, it’s grad can be accessed/set as <entity>.requires_grad. If
it is an object such as nn.Linear(...), use <entity>.parameters().
The above can also be achieved in context via torch.no_grad(), as:
with torch.no_grad():
<do something within this loop. Grads won't be updated here.>
To disable gradients in an entire function, use decorator @torch.no_grad() at top
of that function.
Using None as an indexing operator in tensor, is equivalent to
adding an extra dimension at that index,
Ex:
t1 = torch.rand(3, 100, 100) # shape=[3, 100, 100].

58
t1[None] is similar to performing t1.unsqueeze(0) (along 0th dimension). #
shape = [1, 3, 100, 100].
t1[:, None] is similar to performing t1.unsqueeze(1) (along 1st dimension). #
shape = [3, 1, 100, 100].
t1[..., None] will insert a dimension at the last index.
t1[..., None, :] on the before the last dimension.

- Optimizers: Optimizers are functions that control how the weights &
learning rate are changed, to decrease the associated error i.e. optimize
accuracy. Torch has an ‘optim’ module that provides a list of optimizers:
import torch.optim as optim
Every optimizer constructor takes a list of (model) parameters (aka
PyTorch tensors, typically with requires_grad set to True) as the first input. All
parameters passed to the optimizer are retained inside the optimizer object so
the optimizer can update their values and access their grad attribute.
Each optimizer exposes two methods: zero_grad() and step().
zero_grad() zeroes the grad attribute of all the parameters passed to the
optimizer upon construction, to clear out the (previous) gradients of all
parameters that the optimizer is tracking, as the gradients are accumulated (This
is useful when we want to accumulate gradients across multiple batches, but it
can lead to incorrect gradient computations when we only want to compute the
gradients for a single batch).
step() updates the value of those parameters according to the optimization
strategy implemented by the specific optimizer.
Ex: Adam, SGD, RMSProp, etc.

Adam:
- Adam (Adaptive Moment Estimation) dynamically adjusts the learning rate
for each parameter during training (SGD used fixed LR). It computes a separate
adaptive learning rate for each parameter, which can be particularly useful when
dealing with sparse or noisy gradients.
- Adam incorporates momentum-like behavior through the first-order
moment (mean) of the gradients. It helps smoothen the optimization process and
escape local minima i.e. convergence.

59
- It includes a bias correction mechanism to counteract the initialization bias
in the first few iterations, especially when the moving averages are small at the
beginning of training (SGD does not have this).
- Adam often converges faster and requires less tuning of hyperparameters
like learning rate, compared to vanilla SGD. However, the choice of optimizer
may depend on the specific problem and dataset.

- Overfitting: When the model performs well on training data (loss is less),
but loss is comparatively much more (than acceptable difference) in validation
data, the model is said to overfit. It means the model is not able to generalize
well.
If both training & validation loss is high, it means the model is not able to learn at
all (underfitting).
To reduce overfitting, there is a tradeoff between having a model that can
generalize well, but not overfit. Overfitting can occur when the model has a much
higher number of parameters/neurons than necessary. Hence, one way of
reducing overfitting is to start training with a model that has fewer number of
neurons, then start increasing the neurons, till the model does not overfit, while
still maintaining generalization i.e. increase the parameters size until it fits, and
then scale it down until it stops overfitting.
Another way to reduce overfitting is adding penalization terms to the loss
function, to make it cheaper for the model to behave more smoothly and change
more slowly.

- OverGeneralization: The issue of being very confident about samples


that are far from the training distribution is called overgeneralization.
Ex: a model that is trained to detect birds from planes, is given an input image of
a cat, it will still give its prediction about how bird/plane-like the cat is.

- Training, Validation & Test set:

Training set: A set of examples used for learning, that is to fit the
parameters of the model.

60
Validation set: A set of examples used to tune the hyper-parameters of
a model (manually, or via automated frameworks), for example to choose the
number of hidden units in a neural network.
The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration.

Test set: A set of examples used only to assess the performance of a


fully-specified model. No information from the test set leaks into the model’s
knowledge.

Peeking is a consequence of using test-set performance to both choose a


hypothesis and evaluate it (tune model’s hyperparameters - In simple terms, the
validation set is used to optimize the model parameters (if we are doing it, say
during k fold cross validation) while the test set is used to provide an unbiased
estimate of the final model). The way to avoid this is to really hold the test set out
—lock it away until you are completely done with learning and simply wish to
obtain an independent evaluation of the final hypothesis (And then, if you don’t
like the results… you have to obtain, and lock away, a completely new test set if
you want to go back and find a better hypothesis) i.e. the test set does not
contribute to learning anything, by the model.
If the validation set is not used to learn (hyperparameters) by the model, the
validation set can be considered to be the test set.
If the test set is locked away, but you still want to measure performance on
unseen data as a way of selecting a good hypothesis (peeking or information
leak), then divide the available data (without the test set) into a training set and a
validation set.

- Factors affecting NN performance:

Following points can also be used during training, to analyze & improve
model performance:

(1) Dataset size & variety.

61
(2) Batch size & Shuffling: Usually, a smaller batch size means
better model performance, as more weight updates are performed (batch size of
8-16 is considered as good). Shuffling batches means weights can be avoided
from getting stuck in local minima.
The larger the batch size, more is the time taken to converge and more
iterations required to attain a high accuracy.
(3) Scaled inputs/Normalisation: Scaling all inputs/outputs in a
common range of 0 to 1, or -1 to 1, increases the possibility of the model being
able to better fit the input data.
(4) Learning Rate: Usually, a small learning rate results in more
stable learning, as well as the model being able to fit training data. Alternatively,
learning rate can be annealed (LR scheduling) by being large at beginning of
training, & being lowered, when validation losses do not decrease. Can also use
a learning rate scheduler.
(5) Loss Function selection.
(6) Optimizer selection.

Overfitting related:

(1) Batch Normalisation: Huge weight values (especially in deep hidden


layers) leads to less responsive behaviour in activation functions, impacting
model learning abilities. Batch normalization layers can help rectify this.
(2) Dropouts.
(3) Regularization (adding a penalty term to loss function ex: L1(lasso
regression) & L2(ridge regression)).
(4) Number of NN width (neurons in a layer) & depth (layers).

- Tips for Performance Improvement during Training:

1) Scenario:
(a) Train a model for E epochs. At the end, loss settles at a small value, say L.
(b) When retrain the model for much larger epochs (say 3*E or 4*E times or
more), loss decreases slowly, and finally settles at value similar to L, as in (a).
Tip:

62
Stuck in local optima probably. Try to use Larger learning rate if there is little
change of loss over several iterations - Use a scheduler that periodically starts
from larger LR and brings it down gradually and then again starts from a bigger
one.

TENSORS:

- Tensors are multidimensional arrays. Tensors can be created using


torch.tensor().
Ex:
torch.tensor([2.5, 1. 5]) # 1D tensor.
torch.tensor( [ [2, 1], [4, 9] ] ) # 2D tensor.

- tensor.shape gives the size of the tensor.


tensor.size() also returns the shape of the tensor. tensor.size(<int dim>) returns
the size of the tensor along that dimension.

- Indexing: Just like NumPy, we can use range indexing (0 based)


for each of the tensor’s dimensions.

- Named tensors: can give names to dimensions, when creating tensors,


or modify/delete(by passing None) names of existing tensors.
Ex:
Img1 = torch.Size([3, 5, 5]) (‘channels’, ‘rows’, ‘cols’) # create names.
Img2 = Img1.refine_names(..., ‘chnl’, ‘rows’, ‘cols’) # modify names.

PyTorch does not automatically align dimensions, so we need to do this


explicitly. The method align_as returns a tensor with missing dimensions added

63
and existing ones permuted to the right order:
# align Img3’s data as per Img1’s names, into Img4.
Img4 = Img3.align_as(Img1)
Img4 = Img3.rename(None) # remove all names (unnamed).

- dtype: The default data type for tensors is 32-bit floating-point.


Pytorch many other data types, like torch.float32|64|16, torch.int16|32|64,
torch.uint8, etc.
To change tensor type, use <tensor variable>.to(<type>):
Ex:
t1 = torch.zeros((2,3), dtype=torch.float64) # type=double.
t2 = t1.to(torch.float32) # change type from double to float.

- Placing tensor on CPU/GPU:


In addition to dtype, a PyTorch Tensor also has the notion of device, which
is where on the computer the tensor data is placed. Here is how we can create a
tensor on the GPU by specifying the corresponding argument to the constructor:
points_gpu = torch.tensor([[4.0, 1.0], [5.0, 3.0], [2.0, 1.0]], device='cuda')
points_gpu = points.to(device='cuda') # copy from cpu to gpu.
If our machine has more than one GPU, we can also decide on which GPU
we allocate the tensor by passing a zero-based integer identifying the GPU on
the machine:
points_gpu = points.to(device='cuda:0')
Alternatively:
points_gpu = points.cuda() # create on GPU
points_gpu = points.cuda(0) # Defaults to GPU index 0 (if multiple GPUs)
points_cpu = points_gpu.cpu() # copy from GPU to CPU.

- PyTorch tensors can be converted to NumPy arrays and vice versa very
efficiently using tensor.numpy() & torch.from_numpy().

- Serialize tensor: <tensor> = torch.load(<filename>) & torch.save(<tensor


object>, <file name>) to load/save to file (Ex: an entire model, not just model
weights using python’s pickle) on disk.
While loading, the model class should already be defined beforehand.

64
The disadvantage of this approach is that the serialized data is bound to
the specific classes and the exact directory structure used when the model is
saved. The reason for this is because pickle does not save the model class itself.
Rather, it saves a path to the file containing the class, which is used during load
time. Because of this, your code can break in various ways when used in other
projects or after refactors.

- Exporting pytorch model to ONNX:


To convert pytorch (.pt / .pth) model to (.onnx) format, use
torch.onnx.export() (use import torch.onnx).

- Dataset Augmentation (albumentations):

Albumentations is a Python library for fast and flexible image augmentations.


Albumentations efficiently implements a rich variety of image transform
operations that are optimized for performance, and does so while providing a
concise, yet powerful image augmentation interface for different computer vision
tasks, including object classification, segmentation, and detection.

Installation:

pip:
pip install -U albumentations

conda:
conda install -c conda-forge imgaug
conda install -c conda-forge albumentations

Code:

(1) import albumentations as A # import Albumentations.

# Albumentations uses the most common and popular RGB image format, so use cv2.cvtColor()
to convert BGR to RGB, where applicable.
# Albumentations works with numpy arrays.

65
(2)
(a) Define an augmentation pipeline:
transform = A.Compose([
A.RandomCrop(width=256, height=256),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.2),
])
# Each augmentation will change the input image with the probability set by the parameter p.
Also, many augmentations have parameters that control the magnitude of changes that will be
applied to an image.

(b) Usage:
transformed = transform(image=image)
# transform will return a dictionary with a single key image. Value at that key will contain an
augmented image.
transformed_image = transformed["image"]
OR
transformed_image = transform(image=image)["image"]

66
Keypoints (affine-translation) augmentation (link:
https://albumentations.ai/docs/getting_started/keypoints_augmentation/)
Example:
translate_dict = {"x":(-0.38, 0.3), "y":(-0.13, 0.34)}
aff_transforms = A.Compose([A.Affine(translate_percent=translate_dict,
keep_ratio=True)], keypoint_params=A.KeypointParams(format='xy',
remove_invisible=False))
transformed = aff_transforms(image=img, keypoints=kps) # keypoint (kps) values
should be in image dimensions (i.e. image width, height), not normalized (i.e. 0-1).
transformed_image = transformed['image']
transformed_keypoints = transformed['keypoints']

There are various operations supported in Albumentations, such as blur, crop,


noise, ColorJitter, equalise, flip, brightness, contrast, Affine(rotate, translate,
scaling/zoom, shear) etc.

SCRIPTING:

Fig. 1 Brackets in tensor, & their dimension numbering.

USEFUL FUNCTIONS:

- torch functions that end with an underscore i.e. “_”, perform operations in-
place i.e. they modify the data of the same provided variable tensor.

67
Note that these in-place functions are methods on the torch.Tensor object, not
attached to the torch module like many other functions (e.g., torch.sin()).

- torch.shape: returns size.


torch.size(): returns size.

- Accessing/Modifying Tensors using indices:


Ex:
t1 = torch.zeros((2, 3), dtype=float) # create a tensor with all element values as 0.
t1[0][2] = 45 # assign/modify value at index. Indexes are 0 based.
var1 = t1[0][2] # access value at index i.e. 1st row & 3rd column.
var2 = t1[0,2] # same as above.
var3 = t1[0, :] # select 1st row & all columns (i.e. all columns for 1st row).
var3 = t1[:, 2] # select all rows for the 3rd column.
# Ex: accessing alternate elements.
var[::2] *= 2 # multiply even indices (0,2,4,...) by 2.
var[1::2] *= 2 # multiply odd indices (1,3,5,...) by 2.

- torch.tensor.float(): convert tensor elements dtype to float.

Ex:

68
- tensor.view(*shape): This is similar to numpy.reshape(), & is useful to change
the shape of a tensor, with data being the same.
Calling view() on a tensor returns a new tensor (header) that changes the
number of dimensions and the striding information, without changing the storage.
Ex:
a = torch.range(1, 16) # a contains 16 elements(1 row, 16 cols).
a = a.view(4, 4) # a contains 16 elements (4 rows, 4 cols).

If we are not sure of 1 field (say columns), but know how many of the other
fields (say rows) we want, we can specify the former field as -1, pytorch will
automatically infer the 2nd field. Note that at most only 1 dimension can be -1, in
case the indexes dimensions are more than 2.
Ex:
a = a.view(2,-1) # a contains 16 elements (2 rows, 8 cols)

view(-1) operation flattens the tensor, if it wasn’t already flattened.


Ex:
a = torch.ones(2,3,4)
print(a.shape) # torch.Size([2, 3, 4])
b = a.view(-1)
print(b.shape) # torch.Size([24])

- torch.clone(): copies tensor into a new memory, creating a new


variable.

- torch.ones(*size): Returns a tensor filled with the scalar value 1, with the
shape defined by the variable argument size. Size can be a sequence of
integers, or a list or tuple.
Ex:
a = torch.ones(2, 2)
B = torch.ones( [2, 2] ) # both a & b have the same shapes i.e. (2,2).

- Arithmetic operations:
Ex:

69
y = x * 10 # multiply elements of tensor x by 10. Same as torch.mul().
y = x.add(20) # add 20 to each element in x.
z = torch.div(x, y) # divide each element of x by the corresponding element of y.
z = torch.sub(x, y) # subtract y from x.
Also see [ torch.matmul(x, y) OR x@y ] for matrix operations.

In PyTorch, when you use the * operator to multiply two 2D tensors A and
B, it performs element-wise multiplication (Hadamard product) rather than matrix
multiplication or dot product.
- Element-wise Multiplication: Using the * operator, as in A * B, will
multiply corresponding elements of the two tensors element-wise. This means
that the result will have the same shape as the input tensors (both inputs should
have the same shape), and each element in the result will be the product of the
corresponding elements in A and B.
- Dot Product: Dot product is the sum of products of values in two same-
sized vectors (scalar output, if inputs are 1D arrays).

If you want to compute the dot product between two vectors, you should use
torch.dot().
Unlike NumPy’s dot, torch.dot() intentionally only supports computing the dot
product of two 1D tensors with the same number of elements. To calculate the
dot product, you can perform the element-wise product and then sum the results
(sum of all elements in the resulting matrix) i.e. C = (A*B).sum().
- Matrix Multiplication: Matrix multiplication is basically a matrix version
nd rd
of the dot product (2 & 3 example in above image).
If you want to perform matrix multiplication between two tensors, you should use
the torch.mm() function (if the input tensors are not 2D, it will raise an error. It
does not support broadcasting) or the @ operator; or torch.matmul() (matmul() is

70
more versatile and can handle a wider range of tensor shapes and operations. It
performs matrix multiplication for 2D tensors like torch.mm(), but it can also
handle higher-dimensional tensors and supports broadcasting for tensors of
different shapes).
Inputs of matrix multiplication should have shape: (m x n) & (n x p) - output
will have shape (m x p).
The result of torch.dot() is a scalar. The result of matrix multiplication
is a matrix, whose elements are the dot products of pairs of vectors in each
matrix.

- torch.cat((input1, input2, …), dim=): to concatenate 2 tensors along


the specified dimension. Also see, torch.split().

- torch.argmax(): argmax() returns the indices of the maximum value of all


elements in the input tensor, along the specified dimension.
This is the second value returned by torch.max().

- torch.tensor.eq(<value>): performs element wise equality comparison


with <value>. Can be used with sum(), to get the number of occurrences of
<value> in a tensor.
Other similar functions are le(), ge(), etc.

- numpy.delete(<ndarray>, obj=, axis=): used to delete a particular section


along a dimension/axis (say row(axis=0) or column(axis=1) number), indicated by
“obj”
i.e. {obj=3, axis=0} will delete the 2nd row, while {obj=3, axis=1} will delete the
2nd column from the input (2D) numpy array.

- torch.tensor.item(): get the element from a tensor containing a single


element.
Ex:
t1 = tensor([0.5]) # t1 contains a single element ([0.5]).

71
element = t1.item() # element = 0.5

- torch.randint(low=, high=, *size=): generate a tensor of size, filling


random values between low (inclusive) & high (not inclusive) range.
torch.rand(*size=,...): generate a tensor of size, with random
numbers.
torch.randn(*size=,...): Returns a tensor filled with random numbers from
a normal distribution with mean 0 and variance 1 (also called the standard
normal distribution).

- torch.from_numpy(<numpy variable>): convert from numpy to tensor.


The returned tensor and ndarray share the same memory.

- torch.squeeze(<input>, <axis index>): opposite to unsqueeze(), used


to remove an axis whose index is provided - supplied index should have only 1
item.

- torch.permute(<input>, *size(),...): permutes/rearranges the dimensions


of input tensor, based on provided indices in size arg.

- torch.cuda.is_available(): returns true, if GPU processing is


available. If true, can set the device to ‘cuda’, else ‘cpu’, in tensor.to(device).

- torch.unique(<input tensor>): Returns the unique elements of the input


tensor. Similar to torch.Tensor.unique().

- torch.nn.functional.one_hot(tensor, num_classes=- 1): Takes


LongTensor with index values of shape (*) and returns a tensor of shape (*,
num_classes) that have zeros everywhere except where the index of the last
dimension matches the corresponding value of the input tensor, in which case it
will be 1.

72
Ex:
tens1 = torch.Tensor([0,1,0]) # ‘0’ & ‘1’ both represent different classes.
result = torch.nn.functional.one_hot(tens1.to(torch.long), 2)
# result = [[1,0], [0,1], [1, 0]]

- np.std(<err>): Calculates the standard deviation of (list of) error “err”


(or values), which is calculated using the following formula:
std = np.sqrt(np.mean(((err - np.mean(err))**2))
‘std’ indicates the variability in the errors.
A low standard deviation means that most of the numbers are close to the
mean (average) value.
Low std means errors are mostly of the same type (high or low, depending on
whether mean/MAE is high or low, respectively) i.e. errors cluster closely (close
positive & negative range) around the mean/MAE value.
A high standard deviation means that the values are spread out over a
wider range. High std means errors contain both (comparatively) lower & higher
errors; when taking MAE as the point of comparison i.e. high variability in both
ways - positive as well as negative range.
So, STD, when taken into consideration with other associated parameters, like
mean or MAE, indicate the nature of the overall errors Ex: if average MAE is low,
& STD is also low (low variance), we can assume that all individual errors are
quite low. On the other hand, if the average MAE is low, but STD is high, it could
mean some individual errors are still high.

- timeit:
“timeit” provides a method of measuring the execution time of your Python
code snippets.
Ex Code:
import timeit

def test(n):
return sum(range(n))

result = timeit.timeit('test(n)', globals=globals(), number=1) # execute test(n) 1


time.

73
When you pass the code you wish to measure to timeit.timeit() as a string, it
executes the code number times and returns the execution time.
The default value for “number” is 1,000,000. Be aware that running time-
consuming code with the default value can take significant time.
The code is executed in the global namespace by providing globals() as the
globals argument. Without this, the function test and the variable n from the
example would not be recognized.
timeit.timeit() can also accept a callable object. You can specify lambda
expression with no arguments, in which case the globals argument is not
necessary.
result = timeit.timeit(lambda: test(n), number=loop)
print(result / loop)

timeit.timeit() returns the time (in seconds) it took to execute the code “number”
times.

- time: The elapsed time (i.e. time taken for a section of code to execute)
can be estimated using the “time” module:
Ex Code:
import time

# Start timer
start_time = time.time() # returns a float value (current start time in seconds).

# Code to be timed
# ...

# End timer
end_time = time.time()

# Calculate elapsed time


elapsed_time = end_time - start_time # divide by 3600, to get in hours.
print("Elapsed time: ", elapsed_time)

74
- torch.cuda.stream: This API allows for parallel execution of operations (not
threads) simultaneously on GPU, maximizing usage of available GPU resources.
Allows you to manage and schedule operations on a GPU. It represents an
independent sequence of CUDA operations that can be executed concurrently
with other streams, enabling asynchronous execution of operations on the GPU -
enabling overlap of computation and communication.
Primary functions:
(1) Asynchronous Execution: Streams allow you to perform computations
concurrently on the GPU. You can schedule different operations (kernels,
memory copies, etc.) within different streams, enabling overlap of computation
and communication.
(2) Explicit Control: With streams, you have explicit control over the
execution order of operations on the GPU. This can help in overlapping
computation with data transfers or overlapping different computation tasks,
ultimately improving overall performance.
(3) Resource Management: Streams provide a way to manage
resources on the GPU. They allow you to allocate memory and execute
operations within specific contexts, avoiding contention between different parts of
your code that might be using the GPU simultaneously.
(4) Synchronization: Streams can be synchronized explicitly, ensuring that
operations within a particular stream complete before moving on to the
subsequent ones. This synchronization can be useful in scenarios where the
order of execution matters or when dependencies exist between different
computations.
Example Code:
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
# Initialise cuda tensors here. E.g.:
A = torch.rand(1000, 1000, device = 'cuda')
B = torch.rand(1000, 1000, device = 'cuda')
# Wait for the above tensors to initialise.
torch.cuda.synchronize()
with torch.cuda.stream(s1):
C = torch.mm(A, A)
with torch.cuda.stream(s2):
D = torch.mm(B, B)
# Wait for C and D to be computed.

75
torch.cuda.synchronize()
# Do stuff with C and D.

- torch.cuda.synchronize(): Waits for all kernels in all streams on a CUDA


device to complete.

NOTE: For parallelizing different functional bodies that operate


independently (say, agents), but need to use GPU resources simultaneously, try
using torch.cuda.streams inside and along with python threading.
Note that in Python, different threads do not actually execute at the same
time: they merely appear to. The threads may be running on different
processors, but they will only be running one at a time. For actual
multiprocessing, can spawn processes rather than threads, using python’s
multiprocessing module.
Threads Example code:
import torch.cuda as cuda
import threading

def f1():
time.sleep(6)
print("f1() complete.")
return

def f2():
time.sleep(3)
print("f2() complete.")
return

def start():
# Create threads for f1 and f2.
thread1 = threading.Thread(target=f1)
thread2 = threading.Thread(target=f2)

# Start the threads.


thread1.start()
thread2.start()

# Wait for both threads to complete.

76
thread1.join()
thread2.join()

print("Execution completed.")
return

Output:
f2() complete.
f1() complete.
Execution completed.

- DICOM (Digital Imaging in Medicine) files:


Dicom images/video format can be handled using python’s pydicom
library. DICOM files can contain other data apart from the images/video, such as
patient name & info, etc.
Install pydicom using pip:
pip install pydicom

Usage Ex:
import pydicom as dicom
dicomfile = dicom.dcmread(<file name>)
video = dicomfile.pixel_array # extracts image/video data from dicom file in
compatible form - requires NumPy.

- Saving video file to storage via OpenCV:

Code:
import cv2

def save_video(video, kp, filepath, fps=30):


# opencv style. video should be numpy array with shape: [F,H,W,C].
# In general, colour images are expected in BGR format.
# codec=("libx264"), (*'MJPG')
width = video.shape[2]
height = video.shape[1]

77
result = cv2.VideoWriter(filepath, cv2.VideoWriter_fourcc(*'MJPG'), fps,
(width, height))
for frame_no, frame in enumerate(video):
# optional - display results (18 keypoints per frame here (in “kp”), for illustration).
for i in range(18):
x = int(kp[frame_no, i, 0])
y = int(kp[frame_no, i, 1])
cv2.circle(frame, (x, y), 2, (0,255,0), 3)
result.write(frame)
#cv2.imshow("w1", frame) #optional - display frame.
#cv2.waitKey()
result.release()
return

- Create, Load model with pretrained weights & Inference:

Creating a model instance (object) from it’s Class, loading pretrained


model weights, & making a forward-pass (the input will run through the first set of
neurons, whose outputs will be fed to the next set of neurons, all the way to the
final output) from input image, to get output labels.
In the models module, the uppercase names correspond to classes that
implement popular architectures for computer vision. The lowercase names, on
the other hand, are convenient functions that instantiate models with predefined
numbers of layers and units and optionally download and load pre-trained
weights into them.

Some of the resources like bobby.jpg, imagenet_classes.txt can be found at


Github Repo:
https://github.com/deep-learning-with-pytorch/dlwpt-code/find/master

# import commonly used models.


from torchvision import models

# create a residual network instance with 101 layers & pretrained weights.
resnet = models.resnet101(weights='IMAGENET1K_V1')

# print resnet info i.e. it’s architecture.

78
resnet

#The resnet variable can be called like a function, taking as input one or more images and
producing an equal number of scores for each of the 1,000 ImageNet classes. Before we can
do that, however, we have to preprocess the input images so they are the right size and so that
their values (colors) sit roughly in the same numerical range. In order to do that, the torchvision
module provides transforms.
from torchvision import transforms

# define a preprocess function that will scale the input image to 256 × 256, crop the image to
224 × 224 around the center, transform it(PIL image) to a tensor, and normalise its RGB
components so that they have defined means and standard deviations. These need to match
what was presented to the network during training, if we want the network to produce
meaningful answers.
# transforms.Normalize(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])]) - This will
normalize the image to have a mean of 0 & standard deviation of 1 { formula: value =
(image pixel value - mean) / std } - Note that this does not mean that the output values will
be in the range 0-1. ToTensor() automatically scales values to range (0,1) - if the PIL Image
belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the
numpy.ndarray has dtype = np.uint8, so to convert the values to a range (-1, 1), use 0.5
instead of 127.5 in mean & std.
transforms.ToTensor() also automatically rearranges the ordering of the dimensions of a
cv2 image(initially [H x W x C]), so that the output image has dimensions [C x H x W]. Input
image should have a shape of length 3, even if it is a grayscale image i.e. [H,W,1] - sending
an image of shape [H,W] to transform.resize does not work.
preprocess = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224), transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]) #
this will bring the mean to 0 (note that range won't necessarily be in 0 to 1, but somewhat near it
ex: around -2 to 2, as happens for most images).

# load a test image so that we can preprocess it & feed to our network.
from PIL import Image
# import image - path is relative to where the current jupyter notebook file is saved.
img = Image.open("images/bobby.png")
img # show image in notebook.

# preprocess this image to convert it into the size, crop, etc used for training.
img_t = preprocess(img)

79
# unsqueeze (Returns a new tensor with a dimension of size one inserted at the specified
position without changing it’s contents; just uses an extra index to access it’s elements), for
input to the neural network. Insert the resultant 1D tensor at dimension dim (0 below i.e. a row)
in the resultant variable (batch_t).
import torch
batch_t = torch.unsqueeze(img_t, 0)

# Run the model - for inference, switch to eval mode.


resnet.eval()
# make a forward-pass: give input (image) to resnet instance, to get output (outputs scores).
out = resnet(batch_t)

# See the list of predicted labels(in file ‘imagenet_classes.txt’), by loading a text file listing the
labels in the same order they were presented to the network during training.
with open ('imagenet_classes.txt') as f:
labels = [line.strip() for line in f.readlines()]

# Determine the index corresponding to the maximum score in the out tensor. ‘index’ is of the
form tensor([x]).
# torch.max() Returns a named tuple (values, indices) where values(_ below) is the maximum
value of each row of the input tensor in the given dimension dim(1 below); and indices is the
index location of each maximum value found (argmax).
_, index = torch.max(out, 1) # see Fig. 1
# Convert scores into percentage (*100) using softmax(which gives probabilities); along
dimension dim. Dim is the dimension in tensor, to be selected as input.
percent = torch.nn.functional.softmax(out, dim=1)[0]*100

# index is a tensor, so need to get the actual value by referencing the first element as index[0].
Output the label & it’s confidence/percentage.
labels[index[0]] , percent[index[0]].item()

# can also sort the output using sort() that also provides the indices of the sorted values in the
original array, so that we can get listing of top ‘N’ scores.
_, indices = torch.sort(out, descending=True)

# get the 2nd, 3rd best prediction & so on. ‘Indices’ is of the form tensor([ [x, y, z,...] ]). In
indices[0][1], 1st index ([0]) refers to the outermost brackets of the tensor([ [x,y,z,...] ]), 2nd
index [1]) to next inner brackets.
labels[indices[0][1]] , percent[indices[0][1]].item()
labels[indices[0][2]] , percent[indices[0][2]].item()

80
VSCode via Anaconda:

Terminal inside VSCode can also be used to start your own virtual
environment (python -m venv), install required packages inside this venv, etc.

Same code in Visual Studio Code launched in Anaconda:

NOTE: Install python extension in VSCode, if not installed. Also, always


launch VSCode from the Anaconda “Home” tab.

import torch
from torchvision import models
from torchvision import transforms
from PIL import Image

#print("\nStart\n")

81
resNvar = models.resnet101(weights="IMAGENET1K_V1")
preprocess = transforms.Compose([transforms.Resize(256), transforms.CenterCrop(224),
transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229,
0.224, 0.225])])

# "images" folder is in same location as this source code file.


inImg = Image.open('images/bobby.jpg')
inImgBatch:Image = preprocess(inImg)
#inImg.show()

in_batch_t = torch.unsqueeze(inImgBatch, 0)

resNvar.eval()
output = resNvar(in_batch_t)
#output = tensor([[...]]), so get the 0th dimension.
percentages = torch.nn.functional.softmax(output, dim=1)[0]

# classes txt file is in same location as this (test.py) file.


with open ('imagenet_classes.txt') as f:
labels = [line.strip() for line in f.readlines()]

value, index = torch.max(percentages, 0)

print(labels[index]) # index = tensor(<number>)


print(value.item()) # value = tensor(<number>, grad_fn=<MaxBackward0>)
print("\n")

- Load pretrained models online (git) from Torch Hub:


# TorchHub is a mechanism through which authors can publish a model on GitHub, with or
without pretrained weights, and expose it through an interface that PyTorch understands. This
makes loading a pretrained model from a third party as easy as loading a TorchVision model.
from torch import hub

# Load a pretrained model architecture resnet_18 from “https://github.com/pytorch/vision”


“main” branch, & load with pretrained weights.
resnet_18_model = hub.load('pytorch/vision:main', 'resnet18', pretrained=True)

82
- Real World Data Representation using Tensors:

IMAGES:

- A handy way of manipulating images of different formats with a uniform


API, is using the imageio module:
import imageio

imgVar = imageio.imread(‘<file path & name>’) # load image into numPy array.
imgVar.shape # get size. shape is an alias for size.
imgTrchV = torch.from_numpy(imgVar) # convert to torch tensor.

# imgVar has format (HxWxC). Use permute() to rearrange dimensions. Dimensions are 0
based. Original = (0,1,2). For arranging in (CxHxW), the new format would be (2,0,1). premute()
does not create a new copy, but produces a changed header with new dimensions ordering.
out = imgTrchV.permute(2,0,1) # out tensor format is CHW.

NOTE: imgVar is a numPy array like object.


Any library that outputs a NumPy array will suffice to obtain a PyTorch
tensor, using torch.from_numpy(). The only thing to watch out for is the layout of
the dimensions. PyTorch modules dealing with image data require tensors to be
laid out as C × H × W : channels, height, and width, respectively (dimension
3). When using multiple images i.e. in batches (denoted by N as the batch
number), the dimensions are [N x C x H x W].
To create a dataset of multiple images to use as an input for our neural
networks, we store the images in a batch along the first dimension to obtain an N
× C × H × W tensor. N represents the batch parameter (dimension 4).
For videos (with multiple frames), the format is [N x F x C x H x W],
where F is the number of frames in the supplied video, & N is the batch number
(for sending multiple videos as a batch).

83
- Neural networks exhibit the best training performance when the input data
ranges roughly from 0 to 1, or from -1 to 1 (this is an effect of how their building
blocks are defined). Hence normalization is needed. For image data, one way is
to divide input by 255, another way is to scale input using it’s mean & standard
deviation, so that it has 0 mean & unit std.

- torch.mean(<data>, dim) & torch.var(<data>, <dim>) compute the mean &


variance of the provided input <data>, along the specified dimension i.e. dim=0
specifies the column comparison which gets the average along each column and
dim=1 specifies the row comparison which gets the average along each row.
“dim” could be a single, or a list of dimensions. If dim=None, reduce over all
dimensions.
torch.std(<data>, <dim>) computes the std.

TIME SERIES:

(A) Time series: A dataset having C data fields (columns) can have it’s
rows ordered according to different points in time. This means rows will be
ordered i.e. there will be a relationship of rows with respect to other rows as
connected by points of time. The rows then indicate reading taken for those C
data fields (1st dimension) over a period of time, in intervals (say L - 2nd
dimension).

This C x L can be further batched into higher time intervals. For example, if L
represents hourly reading, there can be further batching based on daily readings,

84
using c x L, where L = 24 (24 hours in a day).

We could then have N number of day’s readings along a new dimension.

(B) Text: Text can have encodings/representations at character or word


level.
Common encodings for character level can be One-hot encoding, where a vector
of length equal to the number of characters to be represented is maintained with
all elements as 0, except the element that is used to represent a single currently
concerned character, being set to 1.

When there are many characters/words, one-hot encoding can be inconvenient,


hence another technique is to use Embedding.
In Embedding, instead of vectors of many zeros and a single one, we can use
vectors of floating-point numbers. A vector of, say, 100 floating-point numbers
can indeed represent a large number of words.
The trick is to find an effective way to map individual words into this 100-
dimensional space in a way that facilitates downstream learning. This is called an
embedding. In principle, we could simply iterate over our vocabulary and
generate a set of 100 random floating-point numbers for each word. This would
work, in that we could cram a very large vocabulary into just 100 numbers, but it
would forgo any concept of distance between words based on meaning or
context. A model using this word embedding would have to deal with very little
structure in its input vectors. An ideal solution would be to generate the
embedding in such a way that words used in similar contexts mapped to nearby
regions of the embedding.

85
Example: If we were to design a solution to this problem by hand, we
might decide to build our embedding space by choosing to map basic nouns and
adjectives along the axes. We can generate a 2D space where axes map to
nouns—fruit (0.0-0.33), flower (0.33-0.66), and dog (0.66-1.0)—and adjectives—
red (0.0-0.2), orange (0.2-0.4), yellow (0.4-0.6), white (0.6-0.8), and brown (0.8-
1.0). Our goal is to take actual fruit, flowers, and dogs and lay them out in the
embedding. So, We can map apple to a number in the fruit and red quadrant.
Likewise, we can easily map tangerine, lemon, lychee, and kiwi in the respective
ranges. For dogs and color, we can embed redbone near red; uh, fox perhaps for
orange; golden retriever for yellow, poodle for white, & so on, as shown in below
figure.

Embeddings are often generated using neural networks, trying to predict a word
from nearby words (the context) in a sentence. For example: we could start
from one-hot-encoded words and use a (usually rather shallow) neural network to
generate the embedding. Once the embedding was available, we could use it for
downstream tasks.
One interesting aspect of the resulting embeddings is that similar words end up
not only clustered together, but also having consistent spatial relationships with
other words. For example, if we were to take the embedding vector for apple and
begin to add and subtract the vectors for other words, we could begin to perform
analogies like apple - red - sweet + yellow + sour and end up with a vector very
similar to the one for lemon.

86
More contemporary embedding models—with BERT and GPT-2 making
headlines even in mainstream media—are much more elaborate and are context
sensitive: that is, the mapping of a word in the vocabulary to a vector is not fixed
but depends on the surrounding sentence. On the flip side, even when we deal
with text, improving the pre-learned embeddings while solving the problem at
hand has become a common practice.

When we are interested in co-occurrences of observations, the word embeddings


we saw earlier can serve as a blueprint, too. For example, recommender
systems—customers who liked our book also bought… use the items the
customer already interacted with as the context for predicting what else will spark
interest.

- Building, Training & Using Neural Network (NN):

PyTorch has a whole submodule dedicated to neural networks, called torch.nn. It


contains the building blocks needed to create all sorts of neural network (NN)
architectures. Those building blocks are called modules (also called layers in
other frameworks).

- PyTorch nn.Module and its subclasses are designed to do prediction on


multiple samples at the same time. To accommodate multiple samples, modules
expect the zeroth dimension of the input to be the number of samples in the
batch.
NOTE: Modules can also contain other Modules, allowing to nest them in a
tree structure. You can assign the submodules as regular attributes. So, if your
model class (say c1) is derived from nn.Module, & your model class in turn has
other modules (derived from nn.Module) as sub modules, saving/loading your
model class c1 will automatically save/load it, and all its submodules recursively.

- Following are the steps for using a torch.nn learning algorithm/model:


(1) Get the input samples ready, in batches.

(2) Create a variable/instance that holds the model responsible for learning:
Ex: linear_model = nn.Linear(<no of inputs>, <no of outputs>)

87
(3) Create an optimizer, and pass model’s parameters to it.
Ex: opt = optim.SGD(linear_model.parameters(), lr=<learning rate>)

(4) Create a loss function instance.


Ex: loss_fn = nn.MSELoss()
NOTE: The gradients are "stored" by the tensors themselves (they
have a grad and a requires_grad attribute) once you call backward() on the loss.

(5) Run the training loop, providing epochs, model, optimizer, loss function,
training & validation data with labels:
def training_loop(epochsN, model, optimizer, lossFn, trainData, trainLabels,
valData, valLabels):
for epoch in range(1, epochsN + 1):
# forward + backward + optimize.
predict = model(trainData) # forward pass.
loss_train = lossFn(predict, trainLabels) # calculate loss.

optimizer.zero_grad() # set grad to 0 in every iteration.

# after backward-pass, all parameters are populated in their “grads”.


loss_train.backward() # backward pass - compute gradient loss.
optimizer.step() # optimize. Updates model using values in grads.

valPredict = model(valData) # perform validation.


loss_val = lossFn(valPredict, valLabels)

If epoch%1000 == 0: # print every 1000 batches.


print(f“Epoch {epoch}, Training loss {loss_train.item():.4f}," f"
Validation loss {loss_val.item():.4f}”) # print stats.

A Better training loop pseudocode would be:

(dataset is split into (random) mini batches - using DataLoader)


For N epochs:
For each minibatch (in dataset) from DataLoader:
(Train with this minibatch)

88
Forward pass
Compute loss
Accumulate gradient of loss (Backward pass)
Update model (optimizer step) with accumulated gradient

By shuffling samples at each epoch and estimating the gradient on one or


(preferably, for stability) a few samples at a time, we are effectively introducing
randomness in our gradient descent.
In SGD (Stochastic gradient Descent), this is what the S is about: working on
small batches (aka minibatches) of shuffled data.
It turns out that following gradients estimated over minibatches, which are
poorer approximations of gradients estimated across the whole dataset, helps
convergence and prevents the optimization process from getting stuck in local
minima it encounters along the way.
Shuffling the dataset at each epoch helps ensure that the sequence of
gradients estimated over mini-batches is representative of the gradients
computed across the full dataset.
Typically, mini-batches are a constant size that we need to set prior to
training, just like the learning rate. These are called hyperparameters, to
distinguish them from the parameters of a model.
For shuffling & organizing data in minibatches, torch offers
DataLoader.
Ex:
dLoadVariable = torch.utils.data.DataLoader(<dataset>, batch_size=<value>,
shuffle=<bool value>, …)
While iteration during training, use the DataLoader object as:
for <inputs>, <labels> in dLoadVariable: # for built-in datasets.
# for <inputs>, <labels> in enumerate( dLoadVariable)¹:
<do something>
<inputs> & <labels> are of size “batch_size” set in
dLoadVariable.

¹ => When the above enumeration on DataLoader is called for a given batch of
size “B”, DataLoader calls Dataset’s overridden __get_item__() B number of
times, using an index (that may be random if shuffle=True).
If the entire dataset is too large to fit in memory, we can load the required
data in __get_item__() function, instead of doing it in DataSet’s __init__().

89
NOTE: If shuffle = True, the data is reshuffled into batches after every
epoch (i.e. after every complete (start to end) iteration over all the batches in
dataloader).

PyTorch code:
import torch
import torch.nn as nn

train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=True)


model = nn.Sequential(
nn.Linear(3072, 512),
nn.Tanh(),
nn.Linear(512, 2),
nn.LogSoftmax(dim=1)) # grads enabled by default.
learning_rate = 1e-2
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
loss_fn = nn.NLLLoss()
n_epochs = 100

#training.
for epoch in range(n_epochs):
for imgs, labels in train_loader:
batch_size = imgs.shape[0]
outputs = model(imgs.view(batch_size, -1)) # imgs.view() specifies output to
be of 2 dimensions, 1st is batch_size, 2nd is inferred.
loss = loss_fn(outputs, labels)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print("Epoch: %d, Loss: %f" % (epoch, float(loss)))

# validation
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64,
shuffle=False)
correct = 0
total = 0
with torch.no_grad():
for imgs, labels in val_loader:

90
batch_size = imgs.shape[0]
outputs = model(imgs.view(batch_size, -1))
_, predicted = torch.max(outputs, dim=1)
total += labels.shape[0]
correct += int((predicted == labels).sum())
print("Accuracy: %f", correct / total)

- Dataset:

We can use in built datasets provided by Pytorch (from torchvision import


datasets), or subclass torch.utils.data.Dataset class to create our own custom
data fetching functionality, overriding the
__init__(self, ...): performing initialization stuff, such as loading input &
output/labels data from corresponding files on disk (if enough memory),
transforming them, preparing to be provided as input/output to model for training,
etc.
__len__(self): return the total number of data points in the entire
database.
__getitem__(self, index): return a single (input, output) data point for
specified index. Operations like loading input/output data from disk, transforming
data, etc. can optionally also be done here, instead of in __init__().

The Dataset retrieves our dataset’s features and labels one data point at a time,
for the DataLoader, which then reshuffles & creates mini batches for use in
training.

HANDY FUNCTIONS FOR LOADING DATA:

- For reading csv files, we can use panda (import pandas as pd)
function pd.read_csv(<file name>). This function loads data into a Dataframe.
Some handy operations on Dataframes:

(1) To get all rows from a dataframe “df”, whose specific column
name(say colX) has value “col_value1”:
df = df.loc[:][df["colX"] == "col_value1"]
[:] means select from all rows. First [] after loc, is for rows; 2nd [] is
for columns. This command returns 1 or more rows.

91
(2) To get a row/column value for a column “colX”, that has row “rowX”
with a value “row_value”:
df.loc[df["rowX"] == “row_value”]["colX"]
This returns 1 value (possibly in an array).

(3) We can use the read_csv() arg chunksize(<int>) to load a huge csv file in
chunks,passing the number of lines to read from the file per chunk. This will
cause the function to return a TextFileReader object for iteration. Whatever
operations we need to do after loading the csv, will need to be done in a loop, for
each loaded chunk.

- To get a list of files (along with files in any nested subdirectories if


desired) in a folder in python, use torchdata.datapipes.iter.FileLister(...).
To install torchdata in anaconda first, use conda install -c pytorch torchdata.
Ex:
from torchdata.datapipes.iter import FileLister
dp = FileLister(root=".", recursive=True)
list(dp) # list all file names.

- To get a list of files/folders inside (first level of) a directory, use


os.listdir(<path>). Beyond the first level of folders, os.listdir() does not return any
files or folders.
Ex:
import os

path = "C://Users//Desktop//sample_folder"
dir_list = os.listdir(path)
# files = [f for f in dir_list if os.path.isfile(path + '/' + f)] # obtaining only the files.
print(dir_list)

os.path.isfile() returns true only if the path name corresponds to a file, &
not a folder. Use not(os.path.isfile()) to check for folders. os.getcwd() returns the
current working directory.
<string>.endswith(<extension str>) can be used to check the extension of
the file name, when string contains the path of the file, with extension.

92
This function can be used for manual control of stepping inside subsequent
levels of folder structures - say when manually stepping through training image
files that are grouped inside individual folders (contained in the supplied directory
path) that have their Label as the folder names (i.e. dir path -> { label1 | label2 |
… folders} -> files in corresponding folder).

To load images from disk, see section “Example of operations on images” later in
this doc.

Building NN:

- torch.nn provides nn.Sequential container to concatenate modules(layers):


Example:
seq_model = nn.Sequential(
nn.Linear(1, 13), # 1 input, 13 outputs
nn.Tanh(), # activation function
nn.Linear(13, 1)) # 13 inputs, 1 output

The end result is a model that takes the inputs expected by the first
module specified as an argument of nn.Sequential, passes intermediate outputs
to subsequent modules, and produces the output returned by the last module.
In the code above, the model fans out from 1 input feature to 13
hidden features, passes them through a tanh activation, and linearly combines
the resulting 13 numbers into 1 output feature.

When inspecting parameters of a model made up of several submodules, it is


handy to be able to identify parameters by name. There’s a method for that,
called named_parameters:
for name, param in seq_model.named_parameters():
print(name, param.shape)
Output:
0.weight torch.Size([13, 1])
0.bias torch.Size([13])
2.weight torch.Size([1, 13])

93
2.bias torch.Size([1])

The name of each module (0,2 in above output) in Sequential is just the ordinal
with which the module appears in the arguments. Interestingly, Sequential also
accepts an OrderedDict,4 in which we can name each module passed to
Sequential:
seq_model = nn.Sequential(OrderedDict([
('hidden_linear', nn.Linear(1, 8)),
('hidden_activation', nn.Tanh()),
('output_linear', nn.Linear(8, 1))
]))

for name, param in seq_model.named_parameters():


print(name, param.shape)
Output:
hidden_linear.weight torch.Size([8, 1])
hidden_linear.bias torch.Size([8])
output_linear.weight torch.Size([1, 8])
output_linear.bias torch.Size([1])

We can also access a particular Parameter by using submodules as attributes:


seq_model.output_linear.bias

Output:
Parameter containing: tensor([-0.0173], requires_grad=True)

- Model Summary: we can print the model summary such as architecture,


number of parameters, etc using:
# pip install torch-summary
from torchsummary import summary

(1) summary(<model variable>, <input size of the model>)


Ex:
summary(model, torch.zeros(1, 3, 224, 224))
# if torchsummary not found error comes up in conda, install torchsummary via pip in
anaconda’s virtual environment:

94
conda info --envs # get virtual environment path.

Output:
base /opt/anaconda3
FastAPI /opt/anaconda3/envs/FastAPI
PyTorch_1 * /opt/anaconda3/envs/PyTorch_1

/opt/anaconda3/envs/PyTorch_1/bin/pip install torch-summary # install torchsummary.

(2) print(model) # prints another form of summary(shows grouping under


features, avgpool, classifier sections) of the model.

- To get the trainable number of parameters in a model, can also use below
code:
num_trainable_params = sum(p.numel() for p in model.parameters() if
p.requires_grad)

- CNN (Convolution Neural Networks):

- The term convolution refers to the mathematical combination of two


functions to produce a third function. It merges two sets of information. In the
case of a CNN, the convolution is performed on the input data with the use of a
filter or kernel (these terms are used interchangeably) to then produce a feature
map.
Ex: Let’s do a 5x5 convolution on the image with no padding and a
stride of 1. If we only consider the width and height of the image, the convolution
process is kind of like this: 12x12 — (5x5) —> 8x8. The 5x5 kernel undergoes
scalar multiplication with every 25 pixels, giving out 1 number every time. We
end up with a 8x8 pixel image, since there is no padding (12–5+1 = 8).

95
- One important point of convolution operation is that the positional
connectivity exists between the input values and the output values (input values
in NxN kernel space affect output value). A convolution operation forms a many-
to-one relationship i.e. for an NxN kernel, it will map NxN values to 1 value.

For a given size of the input (i), kernel (k), padding (p), and stride (s), the size of
the output feature map (o) generated is given by:

- Convolutions deliver locality and translation equivariance.


Translational Equivariance or just equivariance is a very
important property of the convolutional neural networks where the position of the
object in the image should not be fixed in order for it to be detected by the CNN.
This simply means that if the input changes, the output also changes.

96
To be precise, a function f(x) is said to be equivariant to a function g
if f(g(x)) = g(f(x)). Ex: If we have a function g which shifts each pixel of the image,
one pixel to the right i.e I’(x,y) = I(x-1,y). If we apply the transformation 'g' on the
image 'I' and then apply convolution, the result will be the same as if we applied
convolution to 'I' and then applied translation 'g' to the output.
i.e. conv(g(I)) = g(conv(I)).
In short, Translation Invariance means that the system produces exactly the
same response, regardless of how its input is shifted. Equivariance means that
the system works equally well across positions, but its response shifts with the
position of the target.
When processing images, this simply means that if we move the input 1
pixel to the right then it’s representations will also move 1 pixel to the right. The
property of translational equivariance is achieved in CNN’s by the concept of
weight sharing (of kernel). As the same weights are shared across the images,
hence if an object occurs in any image it will be detected irrespective of its
position in the image. This property is very useful for applications such as image
classification, object detection, etc where there may be multiple occurrences of
the object or the object might be in motion.
For e.g: if you are building a model to detect faces all you need to detect is
whether eyes are present or not (translation equivariance), it’s exact position is
not necessary. While in segmentation tasks, the exact position is required.
CNNs are not naturally equivariant to some other transformations such as
changes in the scale or rotation of the image. Other mechanisms are required to
handle such transformations.

Translational Invariance makes the CNN invariant to translation. Invariance


to translation means that if we translate the inputs the CNN will still be able to
detect the class to which the input belongs. Translational Invariance is a result of
the pooling operation. Note that pooling gives some amount of translation
invariance for small shifting of input features - not complete translation invariance
over the entire input image space.
In a traditional CNN architecture, there are three stages. In the first stage,
the layer performs (1) convolution operation on the input to give linear
activations. In the second stage, the resultant activations are passed through a
(2) non-linear activation function such as sigmoid, tanh or ReLU. In the third
stage, we perform the (3) pooling operation to modify the output further. In

97
pooling operation, we replace the output of the convnet at a certain location with
a summary statistic of the nearby outputs such as a maximum in case of Max
Pooling. As we replace the output with the max in case of max-pooling, hence
even if we change the input slightly, it won’t affect the values of most of the
pooled outputs. Translational Invariance is a useful property where the exact
location of the object is not required.

- Convolution for a 2D image is defined as the scalar product of a weight


matrix (kernel / filter), with every neighbourhood pixel in the input.
Note that the weights in the kernel are not known in advance, but they are
initialized randomly and updated through backpropagation. Note also that the
same kernel, and thus each weight in the kernel, is reused across the whole
image. Thinking back to autograd, this means the use of each weight has a
history spanning the entire image. Thus, the derivative of the loss with respect to
a convolution weight includes contributions from the entire image.
With deep learning, we let kernels be estimated from data.

CNNs provide the three basic advantages over the traditional fully connected
layers:
(1) Firstly, they have sparse connections (processing local data
instead of entire image via kernel size) instead of fully connected connections
which lead to reduced parameters and make CNN’s efficient for processing high
dimensional data.
(2) Secondly, weight sharing takes place where the same (kernel)
weights are shared across the entire image, causing reduced memory
requirements as well as translational equivariance.
(3) Thirdly, CNN’s use a very important concept of subsampling or
pooling in which the most prominent pixels are propagated to the next layer
dropping the rest. This provides a fixed size output matrix which is typically
required for classification and invariance to translation, rotation.

Convolution NN contains following blocks:


- Convolution operation
- Pooling Operation
- Flattening layer (to reduce the dimension & flatten outputs, so that it can
then be linked to the subsequent output layer - nn.Flatten())

98
- The torch.nn module provides convolutions for 1, 2, and 3 dimensions:
nn.Conv1d for time series, nn.Conv2d for images, and nn.Conv3d for volumes or
videos.
nn.Conv2d(<input/channels>, <output channels>, <size of kernel>) - More
channels in the output image, means more the capacity of the network. We need
the channels to be able to detect many different types of features. Channels per
convolution is the same as the number of neurons per layer, in a traditional NN.
Essentially, when an image is convolved by multiple filters, the output has
as many channels as there are filters that the image is convolved with.
In general, the more filters there are in a CNN, the more features of an
image that the model can learn about.

Ex:
nn.Conv2d(3, 16, kernel_size=3) # input is 3 channel (RGB), output is 16.
kernel_size=(<height>, <width>) - if single value, it means (<value>, <value>).
For nn.Conv3d(), kernel_size is (v1, v2, v3) i.e. 3 values, indicating a 3D kernel.

- The way of using convolutional networks is: stacking one convolution


after the other and at the same time downsampling (reducing image size) the
image between successive convolutions.
The layers closer to the input layer learn lower level features (ex:
edges), while layers farther from the input learn higher level features (ex: ears,
face, etc).

Formula for output tensor dimensions for Conv2d() layer:

Hout = ( (Hin+ 2*padding[0] − dilation[0]×(kernel_size[0]−1) − 1) / stride[0] ) +1


Wout = ( (Win+ 2*padding[1] − dilation[1]×(kernel_size[1]−1) − 1) / stride[1] ) +1
By default; stride & dilation=1; padding=0, so, with these
values:
Hout = Hin − (kernel_size[0]−1)
Wout = Win − (kernel_size[1]−1)

- Downsampling: Scaling an image by half is the equivalent of taking four


neighbouring pixels as input and producing one pixel as output. Types of scaling
are:

99
Average the four pixels: This average pooling was a common
approach early on but has fallen out of favour somewhat.
Adaptive Avg/Max Pooling: In AdaptiveAvgPool2D(), we specify the
output feature map size instead (can work as a kind of global pooling too). The
layer automatically computes the kernel size so that the specified feature map
size is returned. The major advantage with this layer is that whatever the input
size, the output from this layer is always fixed and, hence, the neural network can
accept images of any height and width. Also see AdaptiveMaxPool2D().

Max Pooling: Take the maximum of the four pixels. This


approach, called max pooling, is currently the most commonly used approach,
but it has a downside of discarding the other three-quarters of the data.
By keeping the highest value in the n × n (say 2 x 2 for n=2) (n = kernel
size of max pool) neighbourhood as the downsampled output, we ensure that the
features that are found survive the downsampling, at the expense of the weaker
responses. Max pooling is provided by the nn.MaxPool2d(<n> OR *size) module.
How much Maxpool<n>d(<kernel_size>) reduces the input tensor
dimensions is given by the below formula:
Output_Height = ((Input_Height + (2*Padding) - Kernel_Height) / Stride) + 1
Output_Width = ((Input_Width + (2*Padding) - Kernel_Width) / Stride) + 1

NOTE: If stride value is not provided, it is assumed to be the


same as kernel size.

Strided Convolution: Perform a strided convolution, where only


every Nth pixel is calculated. A 3 × 4 convolution with stride 2 still incorporates
input from all pixels from the previous layer. The literature shows promise for this
approach, but it has not yet supplanted max pooling.

Depthwise Separable convolution: While standard convolution performs


the channelwise and spatial-wise (i.e. width-height) computation in one step,
Depthwise Separable Convolution splits the computation into two steps:
depthwise convolution applies a single convolutional filter per each input
channel and pointwise convolution is used to create a linear combination of the
output of the depthwise convolution.

100
Depthwise convolutions (used in MobileNet architecture) are faster & compact
than regular convolution operations, as they require fewer computations to get
the same result for a corresponding convolution operation of a given size.

Some disadvantages of Depthwise separable convolutions are:


- Limited Representation Power: Separation can limit the ability of the
network to capture complex spatial patterns compared to traditional convolutions
that can learn more intricate features across different channels.
- Increased Depth: To achieve the same representational power as
traditional convolutions, depthwise separable convolutions often require a larger
number (i.e. stacking of) of depthwise and pointwise filters.

Dilated (atrous) Convolutions:

Dilated convolutions (or atrous - “trous” means holes in French) introduce


another parameter to convolutional layers called the dilation rate. This defines a
spacing between the values in a kernel. A 3x3 kernel with a dilation rate of 2 will
have the same field of view as a 5x5 kernel, while only using 9 parameters.
Imagine taking a 5x5 kernel and deleting every second column and row.

101
This delivers a wider field of view at the same computational cost. Dilated
convolutions are particularly popular in the field of real-time segmentation. Use
them if you need a wide field of view and cannot afford multiple convolutions or
larger kernels.
The technique of atrous convolutions enables the model to capture larger
patterns (with fewer parameters) and performs well (say, in detecting the number
of people in a crowd).

Transposed Convolution:

A convolution operation forms a many-to-one relationship. However, transposed


convolution demonstrates a one-to-many relationship, used for upscaling the
image.
The weights in the transposed convolution are learnable. So we do not
need a predefined interpolation method like Nearest neighbour interpolation, Bi-
linear interpolation and Bi-cubic interpolation.

A transposed convolution is somewhat similar (but not same) to deconvolution


because it produces the same spatial resolution a hypothetical deconvolutional
layer would. However, the actual mathematical operation that’s being performed
on the values is different.
Imagine inputting an image into a single convolutional layer. Now take the output,
throw it into a black box and out comes your original image again. This black box

102
does a deconvolution. It is the mathematical inverse of what a convolutional layer
does.
A transposed convolutional layer carries out a regular convolution but reverts its
spatial transformation. To achieve this, we need to perform some fancy padding
on the input.

Basically, transposed convolution does some padding on the original image


followed by a convolution operation.

2D convolution with no padding, stride of 2 (shifts to 3rd element after 1st


element from current position of kernel - see blue grid) and kernel size of 3.

Transposed Convolution operation - The values of padding and stride are the
one that hypothetically was carried out on the output to generate the input. i.e. if
you take the output, and carry out a standard convolution with stride and padding
defined, it will generate the spatial dimension same as that of the input.

Implementing a transposed convolutional layer can be better explained as a 4


step process:
Step 1: Calculate new parameters z and p’.
Step 2: Between each row and columns of the input, insert z number of
zeros. This increases the size of the input to (2*i-1)x(2*i-1). (i = input image size)
Step 3: Pad the modified input image with ‘p’ number of zeros.

103
Step 4: Carry out standard convolution on the image generated from step 3
with a stride length (s' in figure above) of 1.

For a given size of the input (i), kernel (k), padding (p), and stride (s), the size of
the output feature map (o) generated is given by:

- Subclassing modules: If we wish to have additional functionalities in


modules (such as nn.Linear, nn.Conv2d, n.MaxPool2d, etc) that is not provided
in the predefined modules, we can subclass our custom module from nn.Module
class, & define a forward function (for performing forward pass). This subclass
will be a replacement to the nn.Sequential model. All the inner layers will be
defined (in constructor) & used here, in the forward pass.
Ex: Input image has size 32 x 32 x 3(RGB channels).
import torch.nn as nn

class Net(nn.Module):
def __init__(self):
super().__init__() # always call this.
# registering submodules in the constructor so that parameters() call can find all such
submodules defined at top level in this class. Below objects are available till module exists.
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.act1 = nn.Tanh()
self.pool1 = nn.MaxPool2d(2)

104
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
self.act2 = nn.Tanh()
self.pool2 = nn.MaxPool2d(2)
self.fc1 = nn.Linear(8 * 8 * 8, 32)
self.act3 = nn.Tanh()
self.fc2 = nn.Linear(32, 2)

def forward(self, x):


out = self.pool1(self.act1(self.conv1(x)))
out = self.pool2(self.act2(self.conv2(out)))
out = out.view(-1, 8 * 8 * 8) # custom operation that could not be carried with
nn.Sequential.
out = self.act3(self.fc1(out))
out = self.fc2(out)
return out

- Functional API: Functional means “having no internal state”—in other


words, “whose output value is solely and fully determined by the value of input
arguments”. PyTorch provided functional counterparts to every nn module. It can
be accessed in torch.nn.functional.
nn.Modules are defined as Python classes and have attributes, e.g. a
nn.Conv2d module will have some internal attributes like self.weight.
F.conv2d however just defines the operation and needs all arguments to be
passed (including the weights and bias). Internally, modules will usually call their
functional counterpart in the forward method somewhere.
Functional counterparts can be used for cases, for example: if we
wish to custom manipulate weights & bias with convolution, if we use
nn.Conv2d(), it comes with it’s own weights/bias/parameters, whereas
nn.functional.Conv2d() will use our weights/parameters.
In the above example, we could use modules like nn.Tanh() &
nn.MaxPool2D() inside forward itself, without declaring it in the constructor. For
this purpose, we need to use the functional counterparts of those functions, such
as nn.functional.tanh, nn.functional.max_pool2d, & so on.
Ex:
import torch.nn as nn
import torch.nn.functional as F

105
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
self.fc1 = nn.Linear(8 * 8 * 8, 32)
self.fc2 = nn.Linear(32, 2)

def forward(self, x):


out = torch.tanh(self.conv1(x)) # using functional APIs (functional tanh is moved
from torch.nn.functional to torch namespace). output has size 32x32, 16 channels.
out = F.max_pool2d(out, 2) # output has size 16x16, 16 channels.
out = F.max_pool2d(torch.tanh(self.conv2(out)), 2) # output has size 8x8(due to
maxpool), 8(due to conv) channels.
out = out.view(-1, 8 * 8 * 8) # flatten.
out = torch.tanh(self.fc1(out)) # output now has size: 32.
out = self.fc2(out) # no activation for the final linear module (output size: 2).
return out

This class can now be instantiated as:


modelVar = Net().

Another use case can be like performing in-place operations - being memory
efficient.

- Module Definition: A module is a container for state in the form


of parameters and submodules, combined with the instructions to do a
forward.

- The individual layers of the Module derived class can be accessed/used


for processing individually, such as:
modelVar.conv1(<input>)
Or, in case the layers are defined in nn.Sequential, they can be accessed such
as:
<model variable>.seq_model[0] & so on.
This can be useful in transfer learning.

106
Their weights can also be accessed such as modelVar.conv1.weight.

- Saving/Loading a model:
We can save a model using:
torch.save(<model variable>.state_dict(), <path with filename.pt>)
The saved file now contains all the parameters of the model,
but no structure(i.e. architecture), only weights.
So, for loading the saved model, we need to have our model class handy,
instantiate it, then load it’s parameters with this file. Ex:
modelVar = Net()
modelVar.load_state_dict(torch.load(<path to file.pt>))

load_state_dict() copies parameters and buffers from state_dict into this module
and its descendants (in case module derived class contains other nn.Module
derived modules).
A common PyTorch convention is to save models using either a .pt or .pth file
extension.
NOTE: Using the above method, you will need the model definition(class) to
load the state_dict (after creating the model object using the model class
blueprint).
A good practice is to transfer the model to the CPU before calling torch.save, as
this will save tensors as CPU tensors and not as CUDA tensors. This will help in
loading the model on any machine, whether it contains CUDA capabilities or not.
Ex:
torch.save(<model variable>.to(‘cpu’).state_dict(), …)
NOTE: When loading files from a weights file in a virtual environment,
the path to the file should be relative to the script that is running this command;
or absolute path.

NOTE: tensors can also be saved/loaded using the save()/load()


APIs:
Example Code:
torch.save(pred, "pred.pt") # “pred” is a tensor.
pred = torch.load("pred.pt")

107
Saving/Loading entire model: We can also directly save the model itself,
along with it’s parameters, using torch.save(<model>) (instead of
torch.save(<model>.state_dict())). In that case, we need to load it in the same
way i.e. using <model> = torch.load(<path>), instead of using
<model>.load_state_dict(torch.load(<path>)).
NOTE: When using above method, following are the cons:
Since Python's pickle module is used internally, the serialized data (saved model)
is bound to the specific classes and the exact directory structure. Pickle simply
saves a path to the file containing the specific (model) class.
As you can imagine, the code might break after refactoring as the saved model
might not link to the same path. Using such a model in another project is hard as
well since the path structure needs to be maintained.

Using TorchScript to export a complete model:

One common way to do inference with a trained model is to use


TorchScript, an intermediate representation of a PyTorch model that can be run
in Python as well as in a high performance environment like C++. TorchScript is
actually the recommended model format for scaled inference and deployment.
Using the TorchScript format, you will be able to load the exported model
and run inference without defining the model class.
Ex:
# EXPORT & SAVE MODEL:
model_scripted = torch.jit.script(model) # Export to TorchScript
model_scripted.save('model_scripted.pt') # Save
# LOAD MODEL:
model = torch.jit.load('model_scripted.pt')
model.eval()

Remember that you must call model.eval() to set dropout and batch
normalization layers to evaluation mode before running inference. Failing to do
this will yield inconsistent inference results.

USING TORCHSCRIPT MODEL IN C++:


(Link: https://pytorch.org/tutorials/advanced/cpp_export.html)

108
To load your serialized PyTorch model in C++, your application must depend on
the PyTorch C++ API – also known as LibTorch.
Ex:
#include <torch/script.h>

torch::jit::script::Module module;
// Deserialize the ScriptModule from a file using torch::jit::load().
module = torch::jit::load(<file name>);

// Create a vector of inputs.


std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({1, 3, 224, 224})); # provide a single input
tensor via torch.ones(), of specified dimensions i.e. 3 channel 224x224 inputs, of batch size=1.

// Execute the model and turn its output into a tensor.


at::Tensor output = module.forward(inputs).toTensor(); # Inference. Convert output
values (IValue) to tensor via toTensor().
std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << '\n';

Saving a model’s checkpoint:

Usually, your ML pipeline will save the model checkpoints periodically or when a
condition is met. Usually, this is done to resume training from the last or best
checkpoint. It is also a safeguard in case the training gets disrupted due to some
unforeseen issue.
However, saving the model's state_dict is not enough in the context of the
checkpoint. You will also have to save the optimizer's state_dict, along with the
last epoch number, loss, etc. Basically, you might want to save everything that
you would require to resume training using a checkpoint.
Ex:
SAVE:
torch.save({'epoch': EPOCH,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': LOSS},
'save/to/path/model.pth')

LOAD:

109
model = MyModelDefinition(args)
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

checkpoint = torch.load('load/from/path/model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

- Moving parameters to GPU for faster processing:

We can move the tensors we get from the data loader to the GPU, after
which our computation will automatically take place there. But we also need to
move our parameters to the GPU. nn.Module also provides a .to() function, just
like Tensor.to(). Module.to() moves all of it’s parameters to the GPU.
Module.to() is in place: the module instance is modified. But Tensor.to is
out of place - returns a new tensor. One implication is that it is good practice to
create the Optimizer after moving the parameters to the appropriate device.

It is considered good style to move things to the GPU if one is available. A good
pattern is to set the a variable device depending on torch.cuda.is_available:
device = (torch.device('cuda') if torch.cuda.is_available()
else torch.device('cpu'))
print(f"Training on device {device}.")

Then we can amend the training loop by moving the tensors we get from the data
loader to the GPU by using the Tensor.to method.
Example Training Loop:
for imgs, labels in train_loader:
imgs = imgs.to(device=device) # move “imgs” to GPU.
labels = labels.to(device=device) # move “labels” to GPU.
outputs = model(imgs)
loss = loss_fn(outputs, labels)

We can similarly move the loaded model to a specified device:


loaded_model = Net().to(device=device)

110
There is a slight complication when loading network weights: PyTorch will
attempt to load the weight to the same device it was saved from—that is, weights
on the GPU will be restored to the GPU. As we don’t know whether we want the
same device, we have two options: we could move the network to the CPU
before saving it, or move it back after restoring.
It is a bit more concise to instruct PyTorch to override the device information
when loading weights. This is done by passing the map_location keyword
argument to torch.load:
loaded_model.load_state_dict(torch.load(data_path + 'birds_vs_airplanes.pt',
map_location=device))

- Regularization:

Training a model involves two critical steps: optimization, when we need the
loss to decrease on the training set; and generalization, when the model has to
work not only on the training set but also on data it has not seen before, like the
test set. These 2 steps come under regularization.
They add penalty terms to the loss function that encourage the model's
weights to be small.

(1) Weight Penalties: The first way to stabilize generalization is to add a


regularization term to the loss. This term is crafted so that the weights of the
model tend to be small on their own, limiting how much training makes them
grow.
This makes the loss have a smoother topography, and there’s relatively less to
gain from fitting individual samples (reduce overfitting). Most popular
regularization terms of this kind are L2 regularization (weight decay/Ridge
regularization), which is the sum of squares of all weights (w12 + w22 + …) in the
model, and L1 regularization (Lasso regularization), which is the sum of the
absolute (abs()) values of all weights (|w1| + |w2| + …) in the model (a difference
between MAE & L1 regularization is that MAE is mean i.e. sum is divided by “n”,
whereas L1 is just the sum).
L1 regularization tends to produce sparse models by driving some weights
to exactly zero, while L2 regularization tends to produce smoother weight
distributions (difference between MSE & L2 is that MSE is average i.e. divided by
“n”, whereas L2 is not).

111
Both of them are scaled by a (small) factor, which is a hyperparameter we set
prior to training.
Note that weight decay applies to all parameters of the network, such as biases.
Ex:
Implementing L1 regularization in model:
# code is inside training loop.
def training_loop(...):

L1_regularization = 0 # will hold the regularization (absolute sum of
weights).
for param in model.parameters():
L1_regularization += torch.norm(param,1) # L1 regularization.
torch.norm() provides the absolute value of the weight and bias values across layers.
batch_loss = loss_fn(prediction, y) + 0.0001*L1_regularization # add
this penalty (scaled by 0.0001) to loss.
batch_loss.backward()

For L2 regularization, use torch.norm(param,2) in above code (might need


to make scale larger, around say 0.01, as values might be in -1 to 1 range, so
squaring weights might make the values much smaller already).
L2 can alternatively be specified in the optimizer’s “weight_decay” argument
(range is usually e-4 to e-5).

(2) Dropout: The idea behind dropout is: zero out a random fraction
of outputs from neurons across the network, where the randomization happens at
each training iteration.
By dropping some connections in ANN we are forcing networks to
learn from fewer resources. This forces the model to generalize.
This procedure effectively generates slightly different models with different neu-
ron topologies at each iteration, giving neurons in the model less chance to
coordinate in the memorization process that happens during overfitting.
We can use nn.Dropout(<probability>) module to introduce dropouts in our
model, between the nonlinear activation function (in the current layer) and the
linear or convolutional module of the subsequent layer.

112
Placement of Dropout layer.
Ex:
# in init() of subclass.
self.conv1_dropout = nn.Dropout2d(p=0.4)

# in forward().
out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
out = self.conv2_dropout(out) # use dropout.

Execution of dropout can be avoided during prediction, by setting it’s “train”


property. When we call the model’s .train() / .eval() functions, this call is
automatically replicated on all it’s submodules, so if dropout is among them, it will
behave accordingly in forward & backward passes.
To check if a model is in training mode or not, use <model>.training flag.

During inference too, we use the dropout layer that was used during training.
This means that all the units are considered during the prediction step. But,
because of taking all the units/neurons from a layer, the final weights will be
larger than expected and to deal with this problem, weights are first scaled by the
chosen dropout rate (this is is to make sure that the distribution of the values
after affine transformation during inference time is close to that during training
time). With this, the network would be able to make accurate predictions.

Original Implementation of Dropout.

113
To be more precise, if a unit is retained with probability p during training,
the outgoing weights of that unit are multiplied by p during the prediction stage.

(3) Batch Normalization:

The main idea behind batch normalization is to rescale the inputs (even the
inputs of the hidden layers) to the activations of the network so that minibatches
have a certain desirable distribution (It’s possible that the input distribution at a
particular layer keeps changing across batches). This helps avoid the “inputs
to activation functions” being too far into the saturated portion of the
function, thereby killing gradients and slowing training.
Batch normalization shifts and scales an (intermediate) input using the mean and
standard deviation collected at that (intermediate) location over the samples of
the minibatch. Pytorch provides nn.BatchNorm1D, nn.BatchNorm2d, and
nn.BatchNorm3d modules, depending on the dimensionality of the input.

Placement of Batch Normalization layer.


Since the aim for batch normalization is to rescale the inputs of the
activations, the natural location is after the linear/convolution transformation and
before the activation, although it can be placed after the activation function too:
Ex:
# in init() of subclass.
self.conv1_batchnorm = nn.BatchNorm2d(num_features=n_chans1) # n_chans1 is
the number of output channels from the immediate previous layer.
# in forward().
out = self.conv1_batchnorm(self.conv1(x)) # using batch after conv, but before tanh
activation.
out = F.max_pool2d(torch.tanh(out), 2)

Sometimes, even the outputs of the hidden layers can go large or


extremely small, especially in deep NN, due to multiple matrix multiplications,
resulting in huge or extremely small weights. Batch Normalizations can be used

114
using above BatchNorm layers to normalize such values & bring them into
acceptable range.
Batch Normalisation helps to avoid the issue of gradients becoming so small that
the weights are barely updated, especially in deep NNs.

Training vs Evaluation behaviour:


During training, the BN layer keeps a running estimate of its
computed mean and variance. The running sum is kept with a default
momentum of 0.1. During evaluation, this running mean/variance (of training) is
used for normalization.

Limitations:
Two limitations of batch normalization can arise:
(1) In batch normalization, we use the batch statistics: the mean and
standard deviation corresponding to the current mini-batch. However, when the
batch size is small, the sample mean and sample standard deviation are not
representative enough of the actual distribution and the network cannot learn
anything meaningful.
(2) As batch normalization depends on batch statistics for normalization,
it is less suited for sequence models (we use Layer normalization in sequences).
This is because, in sequence models, we may have sequences of potentially
different lengths and smaller batch sizes corresponding to longer sequences.

- Instance Normalization (IN) & Group Normalization (GN):

Layer Normalization (LN) normalizes the activations along the feature dimension.
Since it doesn’t depend on batch dimension, it’s able to do inference on only one
data sample.

IN is very similar to Layer Normalization but the difference between them is that
IN normalizes across each channel in each training example (i.e. per channel per

115
example), whereas LN normalizes across all features in each training example
(i.e. all features per example).
Unlike BN, IN layers use instance statistics computed from input data in
both training and evaluation mode.

Similar to LN, GN is also used along the feature dimension, but it divides the
features into groups and normalizes each group respectively.

- Skip Connections & ReLU:


A skip connection (also called identity mapping) is nothing but the addition
of the output of a (previous) layer to the input of another (further) layer that is not
adjacent to it (via use of ReLU activation units). It results in the output of the
previous layer being feeded to the input of a future layer.
Ex:
class NetRes(nn.Module):
def __init__(self, n_chans1=32):
super().__init__()
self.n_chans1 = n_chans1
self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(n_chans1, n_chans1 // 2, kernel_size=3,
padding=1)
self.conv3 = nn.Conv2d(n_chans1 // 2, n_chans1 // 2,
kernel_size=3, padding=1)
self.fc1 = nn.Linear(4 * 4 * n_chans1 // 2, 32)
self.fc2 = nn.Linear(32, 2)

def forward(self, x):


out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
out = F.max_pool2d(torch.relu(self.conv2(out)), 2)
out1 = out
out = F.max_pool2d(torch.relu(self.conv3(out)) + out1, 2) # input added from
an earlier layer.
out = out.view(-1, 4 * 4 * self.n_chans1 // 2)
out = torch.relu(self.fc1(out))
out = self.fc2(out)

116
return out

So, the way to implement skip connections is to just arithmetically add earlier
intermediate outputs to downstream intermediate outputs.
nn.ReLU(inplace=True) inplace=True means that it will modify the input
directly, without allocating any separate memory for additional output. It can
sometimes slightly decrease the memory usage, but may not always be a valid
operation (because the original input is destroyed).
Similarly, torch.sigmoid_() is an inplace operation, whereas torch.sigmoid() is not.

Modern Computer Vision with PyTorch (ebook):

- STEPS TO TRAIN A NEURAL NETWORK:

(1) Import the relevant packages (ex: torch, torchvision, numpy, etc).
(2) Build a dataset that can fetch data one data point at a time.
(3) Wrap the DataLoader from the dataset.
(4) Build a model and then define the loss function and the optimizer.
(5) Define two functions to train and validate a batch of data, respectively.
(6) Define a function that will calculate the accuracy of the (validation) data.
(7) Perform weight updates based on each batch of data over increasing epochs,
till desired accuracy (on validation dataset) & acceptable loss values (on training
dataset) are obtained. Also, plot the accuracies & losses in a graph, over the
epochs iteration.

117
- Example of operations on images:

# pip install opencv-python


# conda install -c conda-forge matplotlib

import cv2 # for loading image file from disk.


import matplotlib.pyplot as plt # for showing the image.
import torchvision.transforms as transforms

img = cv2.imread(<file path>) # load image from disk. Channels are


ordered as BRG, instead of RGB. Format (H x W x C)

# crop image by selecting between start & end rows, start & end columns.
img = img[<rowStart> : <rowEnd> , <columnStart> : <columnEnd>]

# convert color type of image (RGB to grayscale, reorder channels, etc).


img2 = cv2.cvtColor(<input image>, cv2.COLOR_<op code>)

plt.imshow(<image>) # plot/show image.

img = cv2.resize(<input image>, <size>) # resize an image.

# convert cv2 image to pytorch tensor. Can use below technique for “PIL Image” too.
t1 = transforms.ToTensor()
imgTensor = t1(img) # Format (C x H x W).

# convert to format (H x W x C) & show in the matplotlib window.


plt.imshow(imgTensor.permute(1,2,0)) # permute does not change the original
tensor shape, just returns a new view of the same data.
plt.show()

- Plotting of values can be done using matplotlib:


Ex:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,5)) # 20 along the Y axis, 5 along the X axis.

118
plt.title(‘title’) # plot title.
plt.plot(valuestensorX, valuesTensorY, label=’some label’) # plot data.

- Data Augmentations:

This library can be used for performing many augmentations/modifications


to images, such as affine (translation, rotation, scaling), adding noise, etc., using
augmenters class (conda install imgaug):
Ex:
from imgaug import augmenters as iaa
aug = iaa.Affine(scale=2) # perform scaling.
aug = iaa.Affine(translate_px=10) # translation by 10 pixels (-10 means translate in
opposite direction).
aug = iaa.Affine(translate_px={'x':10,'y':2}) # translation by 10 pixels across columns, 2
across rows. (-10, 10) means, translate randomly within range.
aug = iaa.Affine(rotate=30, fit_output=True) # rotation by 30 degrees.

NOTE: Sometimes, during transformation of images, certain pixels


are cropped out of the image post-transformation. fit_output is a parameter that
can help with the preceding scenario. By default, it is set to False.
Note that when the size of the augmented image increases (for example, when
the image is rotated), we need to figure out how the new pixels that are not part
of the original image should be filled in. cval specifies the pixel value of the new
pixels that are created when fit_output is True.

Augmenting a batch of N images is much faster, than augmenting N images one


at a time. Use augment_images() function for batch augmentation (returns a
numpy array).
Also, use collate_Fn=<function name> to use batch processing with a custom
defined function, as Dataset’s __get_item__() returns only one image index.
Inside the overridden collate_fun() inside a Dataset class, a common code
is return tuple(zip(*<batch>)).
'*' is the 'splat' operator. It is used for unpacking a list into arguments. For
example: foo(*[1, 2, 3]) is the same as foo(1, 2, 3).
The zip() function takes n iterables, and returns y tuples via the tuples()
function, where y is the least of the length of all of the iterables provided. The yth

119
tuple will contain the yth element of all of the iterables provided. Ex: zip(['a', 'b',
'c'], [1, 2, 3]) -> ('a', 1) ('b', 2) ('c', 3).
This will help to reorganize the batch data such that all inputs have the
same size, as required by DataLoaders.

- Training with minimal data:

1) Zero shot learning:

Zero-shot Learning is a setup in which a model can learn to


recognize things that it hasn't explicitly seen before in training.
Intuitively, we resort to the attributes (where the attributes are not given
explicitly for training) of the object in the image and then try to identify the object
that matches the most attributes.
We can leverage Word Vectors for automatic learning & matching of attributes.
Word vectors encompass semantic similarity among words. For example, all
animals would have similar word vectors and automobiles would have very
different word vector representations.
Following is a general strategy for implementing zero-shot learning:
1) Import the dataset – which constitutes images and their corresponding
classes.
2) Fetch the word vectors corresponding to each class from pre-trained word
vector models.
3) Pass an image through a pre-trained image model such as VGG16. We
expect the network to predict the word vector corresponding to the object in the
image.
4) Once we've trained the model, we predict the word vector on new images.
5) The class of the word vector that is closest to the predicted word vector is the
class of the image.

2) Few Shot learning:

Few shot learning is a setup in which a model can classify an input


based on very few examples. Some networks that help in this are Siamese
networks, Prototypical networks & Relation Matching networks. All three

120
algorithms aim towards learning to compare two images to come up with a score
for how similar the images are.
Here, we input a few (input) images of each class to the network while
training and ask it to predict the class for a new (query) image based on the
images. So far, we have been using pre-trained models to solve such problems.
However, such models are likely to overfit soon, given the tiny amount of data
that is available.
We can leverage multiple metrics, models, and optimization-based
architectures to solve such scenarios. We use metric-based architectures that
come up with an optimal metric, either a Euclidean distance or cosine similarity,
to group similar images together and then predict on a new image.
An N-shot k-class classification is where there are N images each for
the k classes to train the network.
Uses of such similarity measures can be found in applications for hand
written checks, face recognition, etc.

- Siamese Networks:

The word Siamese in the preceding architecture relates to passing two


images through a twin network (where we duplicate the network to handle two
images) to fetch image encodings of each of the two images.
Further, we are comparing the encodings of two images to fetch a
similarity score for the two images. If the similarity score (or dissimilarity score) is
beyond a threshold, we consider the images to be of the same person.
We expect the CNN to sum the loss values both corresponding to the
classification loss if the images are of the same person, and the distance
between the two images. We can use triplet loss or contrastive loss (It takes as
input a pair of samples that are either similar or dissimilar, and it brings similar

121
samples closer and dissimilar samples far apart) to train the network - more
formally, we suppose that we have a pair (Ii, Ij) and a label Y that is equal to 0 if
the samples are similar and 1 otherwise. To extract a low-dimensional
representation of each sample, we use a CNN f that encodes the input images Ii
and Ij into an embedding space where xi = f(Ii) and xj = f(Ij). The contrastive loss
is defined as:
L = (1-Y) * (|xi - xj|)2 + Y * max(0, m - (|xi - xj|)2)
where m is a hyperparameter, defining the lower bound distance between
dissimilar samples.
Contrastive loss is used to penalize the network (during training) for
marking 2 images as dissimilar (high metric value) when they are similar; as well
as for marking 2 images as similar (low metric value) when they are dissimilar.

Twin networks are also used for Object Tracking, because of its unique two
tandem inputs and similarity measurement.

- Prototypical Networks:

A prototype is the representative of a certain class. Imagine a scenario where we


give you 10 images per class and there are 5 such classes. Prototypical
networks come up with a representative embedding (a prototype) for each class
by taking the average of embeddings of each input/image belonging to a class.
Classification is then performed by simply finding the nearest class
prototype in this representation space, for a given test data.

122
In the preceding example illustration, there are three classes and each
circle represents the embeddings of the images belonging to the class. Each star
(prototype) is the average embedding across all the images (circles) present in
the image.

- Relation Networks:

A relation network is fairly similar to a Siamese network, except that the metric
we optimize for is not the L1 distance between embeddings but a relation score.

In the preceding diagram, the pictures on the left are the support set for
five classes and the dog image at the bottom is the query image:
(a) Pass both the support and query images through an embedding module,
which provides embeddings for the input image.
(b) Concatenate the feature maps of the support images with the feature maps
of the query image.
(c) Pass the concatenated features through a CNN module to predict the
relation score.
The class with the highest relation score is the predicted class of the query
image.

123
(A) CLASSIFICATION:

1) Example Classifier code for training Cats & Dogs images on disk:

(training dataset taken from link:


https://www.kaggle.com/datasets/tongpython/cat-and-dog)

- DataSet File code:

import torch
from torch.utils.data import Dataset
from torchvision import transforms
from matplotlib import pyplot as plt
import os
import cv2

# this class loads the data from the specified train/test folder for Cat-Dog-
Classification, & prepares the input data (transforms).

class CatDogDataSet(Dataset):
output_classes = [] # class variable - contains the output classes.
training_set_str = "/training_set"
test_set_str = "/test_set"

# usually, this should contain just the list of the file names in training (or
validation) dataset.
def __init__(self, path, isTraining) -> None:
super().__init__()
self.data = []
self.labels = [] # this will be converted to one-hot tensor just before
training/validation, using torch.nn.functional.one_hot().
# get output classes from training(or test) set. exclude hidden folders.
CatDogDataSet.output_classes = [folderStr for folderStr in
os.listdir(path+CatDogDataSet.training_set_str) if not(folderStr.startswith('.'))]
# convert to tensor (values range = (0,1)), resize shortest edge of input
image to size=128, then use center-crop to make image size (128, 128) by cropping from
center point.
self.preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(128), transforms.CenterCrop(128)])

124
# Here, entire data is being loaded in init() itself.
self.load_data(path, isTraining)
#print(len(self.data), len(self.labels))

# convert labels (list of indices of output_classes) into tensor of size (L,


{ len(output_classes) or (-1) }), where L = len(self.labels); with one-hot-encoding.
# comment this, if labels are to be used as 1D class indices tensor. Also,
comment statement "labels = torch.argmax(labels, dim=1)" in validate() function.
#self.labels = torch.nn.functional.one_hot(torch.tensor(self.labels),
len(CatDogDataSet.output_classes))
#self.labels = self.labels.to(torch.float32)

# returns the length of the total data in training (or validation) set.
def __len__(self):
return len(self.data)

# Returns the data at specified index. Usually, data should be loaded into memory
here (from the filename-string at index), & then returned.
def __getitem__(self, index):
return self.data[index], self.labels[index] # labels[index] returns a
tensor of shape (1, len(output_classes)).

# class method - returns the number of output classes. To be provided to model


class.
def get_number_of_output_classes():
return len(CatDogDataSet.output_classes)

# to skip hidden files in macOS.


def should_skip_file(self, fileName):
if fileName.startswith('.'):
return True
else:
return False

# loads data into memory.


def load_data(self, path, isTraining):
if isTraining:
sub_path = CatDogDataSet.training_set_str
else:
sub_path = CatDogDataSet.test_set_str

125
path = path + sub_path

# apply transforms here, after loading data files.


for label_index, label_folder in enumerate(CatDogDataSet.output_classes):
label_folder_path = path + "/" + label_folder
# loop through all the images in this "content" folder.
for image_name in os.listdir(label_folder_path):
if self.should_skip_file(image_name):
continue
# load image in data.
img = cv2.imread(label_folder_path + '/' + image_name)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = self.preprocess(img)
# append this transformed data & it's label to the corresponding "data"
& "labels" list.
self.data.append(img)
self.labels.append(label_index) # store label index (0 based). This
will be converted to one-hot-encoding later, just before training/validation.

- Train/Validate File Code:

import torch
from torch.utils.data import DataLoader
from torch.nn import Module
import torch.optim as optim
from CatDogDataset import CatDogDataSet

def train(epochs, dl:DataLoader, model:Module, lossFn, optimizerFn:optim):


model.train()
for epoch in range(epochs):
for _, [trainData, labels] in enumerate(dl):
predictions = model(trainData) # forward pass.

# can also try converting prediction from one-hot to indices, using


arg_max(), & passing this to loss function (assuming labels too contains indices
instead of one-hot encoding) ? - see explanation in main.py, in the beginning.
#predictions = torch.argmax(predictions, 1) ###

loss = lossFn(predictions, labels) # compute loss.

126
optimizerFn.zero_grad()
loss.backward() # backward pass.
optimizerFn.step() # update weights.

print(f"Epoch: {epoch} - loss = {loss.item()}")

def validate(dl:DataLoader, model:Module):


model.eval()
total = 0
correct_predictions = 0
with torch.no_grad():
for _, [test_data, labels] in enumerate(dl):
outputs = model(test_data)
_, predictions = torch.max(outputs, dim=1) # get indices of largest
values per row.
# convert predictions(shape[batch_size]) to shape(batch_size,
output_classes) using one-hot.
#predictions = torch.nn.functional.one_hot(predictions,
len(CatDogDataSet.output_classes))###
#print(labels.shape, predictions.shape)

# convert labels from one-hot to class indices. For validation, using class
indices is better, for ease of computation i.e. (predictions == labels).sum().
#labels = torch.argmax(labels, dim=1)

accuracy_ratio, current_correct_predictions = compute_accuracy(predictions,


labels)
print(f"Accuracy per batch = {accuracy_ratio}")
total = total + labels.shape[0] # maintain total number of samples seen
so far.
# maintain correct predictions so far.
correct_predictions = correct_predictions + current_correct_predictions
total_accuracy_ratio = correct_predictions / total
print("\n")
print(f"Total Accuracy: {total_accuracy_ratio}")
print("\n")

def compute_accuracy(predictions, labels):


total = labels.shape[0] # basically batch_size (rows)

127
# predictions & labels are 2D tensors when using one-hot-encoding, so need to
divide by number of output classes & use the int() value.
#correct_predictions = (predictions == labels).sum() /
CatDogDataSet.get_number_of_output_classes() ###
correct_predictions = (predictions == labels).sum()
accuracy_ratio = correct_predictions.int().item() / total
return accuracy_ratio, correct_predictions.item()

- Classifier Model code:

import torch
import torch.nn as nn
import torchvision

# model_arch1: conv2D {1,2,3} = Validation Accuracy ~ 0.7716


# model_arch2: Validation accuracy ~ 0.8709.
# model_arch3: Validation accuracy ~ 0.8947.

class DogCatClassifierModel(nn.Module):
#model_name = "cat_dog_classifier_model.pth" # model is saved with this name.
#model_name = "cat_dog_classifier_model_arch2.pth"
model_name = "cat_dog_classifier_model_arch3.pth"

def __init__(self, output_classes) -> None:


super().__init__()
self.training = True # flag that needs to be set for training/eval from
outside.

# model architecture.
# input image size (after applying transforms) = 128 x 128 pixels.
#self.model = self.model_arch1(output_classes)
#self.model = self.model_arch2(output_classes)
self.model = self.model_arch3(output_classes)

def forward(self, inputs):


if(self.training):
self.model.train()
else:

128
self.model.eval()

""" # for debugging shapes of individual layers.


op1 = nn.Conv2d(3,16,3)
outputs = op1(inputs)
print(outputs.shape)
op2 = nn.MaxPool2d(2)
outputs = op2(outputs)
print(outputs.shape)

op3 = nn.Conv2d(16, 32, kernel_size=3)


outputs = op3(outputs)
print(outputs.shape)
op4 = nn.MaxPool2d(2)
outputs = op4(outputs)
print(outputs.shape)
"""

out = self.model(inputs)

return out

# self designed basic model architecture (gives validation accuracy ~ 0.7716).


def model_arch1(self, output_classes):
return nn.Sequential(
# conv2D 1.
nn.Conv2d(3, 16, kernel_size=3), # [16, 128, 128] -> [16, 126, 126] -
kernel size = 3 reduces input image's width & height by 2 pixels (1 on each side).
input channels = 3, output channels = 16.
nn.MaxPool2d(2), # [16, 126, 126] -> [16, 63, 63] -
(126/2=63) maxpool2D kernel size = 2.
nn.ReLU(),
# conv2D 2.
nn.Conv2d(16, 32, kernel_size=3), # [32, 63, 63] -> [32, 61, 61] . input
channels = 16, output channels = 32. due to kernel size=3, reduces by 2 pixels:(63-
2=61).
nn.MaxPool2d(2), # [32, 61, 61] -> [32, 30, 30]
(61/2=30)
nn.ReLU(),
# conv2D 3.

129
nn.Conv2d(32, 8, kernel_size=3), # [32, 30, 30] -> [32, 28, 28]. input
channels = 32, output channels = 8. due to kernel size=3, reduces by 2 pixels:(30-
2=28).
nn.MaxPool2d(2), # [32, 28, 28] -> [8, 14, 14]
(28/2=14)
nn.ReLU(),

nn.Flatten(), # [-1, 1568]. (8*14*14 = 1568)


nn.Linear(1568, 80), # reduce the next output from 1568 to
80.
nn.ReLU(),
nn.Linear(80, output_classes), # reduce final output to
the number of classes i.e. 2 (cats & dogs).
)

# model arch with batch-normalization (gives validation accuracy ~ 0.8709).


def model_arch2(self, output_classes):
# we can chain nn.Sequential() inside nn.Sequential() with as much depth as we
wish.
return nn.Sequential(
self.conv_layer(3, 16, 3), # [16, 128, 128] -> [16, 63, 63] ( (128-
2)/2=63 ), kernel size=3.
self.conv_layer(16, 64, 3), # [16, 63, 63] -> [64, 30, 30] ( (63-
2)/2=30 )
self.conv_layer(64, 128, 3), # [64, 30, 30] -> [128, 14, 14] ( (30-
2)/2=14 )
self.conv_layer(128, 32, 3), # [128, 14, 14] -> [32, 6, 6] (
(14-2)/2=6 )
self.conv_layer(32, 16, 3), # [32, 6, 6] -> [16, 2, 2] ( (6-2)/2=2 )
nn.Flatten(), # make 1D (16 * 2 * 2 = 64)
nn.Linear(64, 50),
nn.ReLU(),
nn.Linear(50, output_classes)
)

# model arch with batch-normalization (gives validation accuracy ~ 0.8947).


def model_arch3(self, output_classes):
# we can chain nn.Sequential() inside nn.Sequential() with as much depth as we
wish.

130
# difference from model_arch2() is that number of channels are increasing till
end.
return nn.Sequential(
self.conv_layer(3, 16, 3), # [16, 128, 128] -> [16, 63, 63] ( (128-
2)/2=63 ), kernel size=3.
self.conv_layer(16, 64, 3), # [16, 63, 63] -> [64, 30, 30] ( (63-
2)/2=30 )
self.conv_layer(64, 128, 3), # [64, 30, 30] -> [128, 14, 14] ( (30-
2)/2=14 )
self.conv_layer(128, 256, 3), # [128, 14, 14] -> [256, 6, 6] ( (14-
2)/2=6 )
self.conv_layer(256, 512, 3), # [256, 6, 6] -> [512, 2, 2] (
(6-2)/2=2 )
nn.Flatten(), # make 1D (512 * 2 * 2 = 2048)
nn.Linear(2048, 50),
nn.ReLU(),
nn.Linear(50, output_classes)
)

# create blocks that can be copied in a loop when create model layers in
architecture.
# n_i = input channels, n_o = output channels.
def conv_layer(self, n_i, n_o, kernel_size):
return nn.Sequential(
nn.Conv2d(n_i, n_o, kernel_size), # width & height (-2)
nn.BatchNorm2d(n_o),
nn.ReLU(),
nn.MaxPool2d(2) # width & height (/2)
)

- Main.py code:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn import Module
import torch.optim as optim
from CatDogDataset import CatDogDataSet
from DogCatClassifierModel import DogCatClassifierModel
from train_validate import train, validate
import os

131
# ### in code -> using 1D tensor instead of one-hot encoding. Using class indices
instead of one-hot-encoding (when using CrossEntropyLoss) doesn't work (on
predictions), as when we use torch.argmax() or related functions, they are not
differentiable (no gradients), hence loss cannot be computed.

# see note in CrossEntropyLoss() docs - The performance of this criterion is generally


better when target(i.e. labels/ground truth) contains class indices, as this allows
for optimised computation. Consider providing target as class probabilities only when
a single class label per minibatch item is too restrictive. It seems that labels can
contain class indices(1D tensor of 0 to (C-1) values for C classes), but predictions
should have probabilities for each class (& not 1D tensor of class indices), so that
learning can occur.
# link:
https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html#torch.nn.Cros
sEntropyLoss

# to use labels as 1D vector of class indices, commented last 2 lines of


CatDogDataSet's __init__() function. when labels were used as class indices in
model_arch3(), validation accuracy is ~ 0.8843.

path = os.getcwd() + "/dataset" # path to dataset.

# trains the model on the training set.


def run_training():
CDDataset = CatDogDataSet(path, True) # load data into Dataset from storage.
dLoader = DataLoader(CDDataset, batch_size=100, shuffle=True) # link Dataloader
with Dataset.
model = DogCatClassifierModel(CatDogDataSet.get_number_of_output_classes()) #
create model.
model.training = True
lossFn = nn.CrossEntropyLoss() # create loss.
optimFn = optim.Adam(model.parameters(), lr=0.001) # create optimizer.
epochs = 20
print("------------Training Started------------")
train(epochs, dLoader, model, lossFn, optimFn) # train model with dataset.
print("------------Training Completed------------")
torch.save(model.state_dict(), DogCatClassifierModel.model_name) # save
model weights to disk. Good practice to explicitly move the model to device=“cpu”
before storing.

132
# runs validation on validation set, using a model trained using run_training().
def run_validation():
CDDataset = CatDogDataSet(path, False) # prepare validation data.
dLoader = DataLoader(CDDataset, batch_size=200, shuffle=False)
model = DogCatClassifierModel(CatDogDataSet.get_number_of_output_classes())
model.training = False
result = model.load_state_dict(torch.load(DogCatClassifierModel.model_name))
# load model weights from disk, into model variable.
print("----------Validation Started--------------")
validate(dLoader, model)
print("----------Validation Complete--------------")
print("\n")

#run_training() # uncomment this, to run training.

run_validation() # uncomment this, to run validation.

Transfer Learning:

Transfer learning is a technique where knowledge gained from one task is


leveraged to solve another similar task.
Typically, the pre-trained models used to perform transfer learning
are trained on millions of images (which are generic and not the dataset of
interest to us) and those pre-trained models are now fine-tuned to our dataset of
interest.
The various filters (kernels) of the model would activate for a wide variety
of shapes, colours, and textures within the images. Those filters can now be
reused to learn features on a new set of images. Post learning the features, they
can be connected to a hidden layer prior to the final classification layer for
customizing the new data.

Steps for Transfer learning:

133
1) Normalize the input images, normalized by the same mean and
standard deviation that was used during the training of the pre-trained model.
2) Fetch the pre-trained model's architecture & its trained weights.
3) Discard the last few layers of the pre-trained model.
4) Connect the truncated pre-trained model to a freshly initialized layer
(or layers) where weights are randomly initialized. Ensure that the output of the
last layer has as many neurons as the classes/outputs we would want to predict.
5) Ensure that the weights of the pre-trained model are not trainable (in
other words, frozen/not updated during backpropagation, by setting
requires_grad = False), but that the weights of the newly initialized layer and the
weights connecting it to the output layer are trainable (we do not train the
weights of the pre-trained model, as we assume those weights are already well
learned for the task, and hence leverage the learning from a large model. In
summary, we only learn the newly initialized layers for our small dataset).
6) Train the trainable parameters of the model via usual training
techniques.

Tips:

- Learning rate during Fine-Tuning: It is preferable to keep the LR much


lower when fine-tuning, as compared to what the LR was, during initial (before
fine-tuning) training (say 100 to 1000 times less ex: 3e-8). This will help to retain
already learnt functionalities, while adjusting only as much is needed for new
data.

- Freezing & activation functions: We never freeze activation functions


like ReLU, etc, as they do not have a “required_grad” attribute, hence, such
operations are not applicable here.

- Differential Learning rates: It means having different learning rates for


different parts of the network during training (also called Discriminative LR). The
idea is to divide the layers into various layer groups and set different learning
rates for each group so that we get ideal results.
Ex: It is usually desirable to have lower LR for earlier layers (to prevent
overfitting to specific examples/features) & comparatively higher LRs for training

134
layers at more depth (i.e. where higher level features are learnt; to speed up
convergence or fine-tune their learning).
This allows for finer control over how learning occurs.

Example Code:
optimizer = optim.AdamW([{'params': model.module1.parameters(), 'lr':3e-4},
{'params': model.module2.parameters(), 'lr': 5e-7}])
Above code specifies different learning rates (3e-4 & 5e-7) to different
modules in the model i.e. module1 & module2, as a list of dictionaries.

(A) Example: Transfer learning by modifying VGG16 architecture:

VGG (Visual Geometry Group) is the architecture name, while 16


denotes the number of layers used in this architecture. Other variants exist, such
as VGG11, VGG19, etc. This architecture was runner up in the Imagenet
competition.

Code snippets for Cat-Dog Dataset:

# The models module in the torchvision package hosts the various pre-trained models available
in PyTorch.
from torchvision import models
device = 'cuda' if torch.cuda.is_available() else 'cpu'
from torchsummary import summary

135
# Load the VGG16 (pretrained) model.
model =
models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1).to(device)
summary(model, torch.zeros(1, 3, 224, 224)) # get the model
summary/architecture.
print(model) # prints another form of summary(shows grouping under features,
avgpool, classifier sections) of the model.

# Above simple printing of “mode” reveals that VGG has 3 main modules: features, avgpool &
classifier. We usually freeze features & avgpool, training on a new classifier module only. Delete
the classifier module (or only a few layers at the bottom) and create a new one in its place that
will predict the required number of classes corresponding to our “cats-dogs” dataset (instead of
the existing 1,000).

# (1) transform training images similar to ones used in training VGG initially.
img = cv2.resize(img, (224, 224)) # can use transforms.Resize() & .CenterCrop() too;
while loading input images.
transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])

# NOTE: While leveraging pre-trained models, it is mandatory to resize, permute, and then
normalize images.

# Freeze (all) model parameters.


for param in model.parameters():
param.requires_grad = False

# REPLACING MODULES OF VGG16 PRETRAINED NETWORK:


# (A) Replace the avgpool module to return a feature map of size 1 x 1.
model.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

# (B) Define (replace with) the new classifier module.


model.classifier = nn.Sequential(nn.Flatten(), nn.Linear(512, 128), nn.ReLU(),
nn.Dropout(0.2), nn.Linear(128, 1), nn.Sigmoid())

# Define the loss function according to our cat-dogs requirement i.e. binary loss (can continue
using CrossEntropyLoss too).
loss_fn = nn.BCELoss()

136
(B) Transfer learning by modifying Resnet18 Architecture:

ResNet (Residual Networks) solve the problem of vanishing gradients by


making available the raw output of any layer (say initial layer), to another (say
final) layer, so that due to skipping the intermediate layers, so that the backward
gradients will flow freely to the initial layers with little modification (& hence
helps in increasing model depth).
The term residual in the residual network is the additional information that
the model is expected to learn from the previous layer that needs to be passed
on to the next layer.

While so far, we have been interested in extracting the F(x) value, where x is the
value coming from the previous layer, in the case of a residual network, we are
extracting not only the value after passing through the weight layers, which is
F(x), but are also summing up F(x) with the original value, which is x.
So far, we have been using standard layers that perform either linear
or convolution transformations F(x) along with some non-linear activation. Both of
these operations in some sense destroy the input information. For the first time,
we are seeing a layer that not only transforms the input, but also preserves it, by
adding the input directly to the transformation – F(x) + x.

ResNet18 (18 layers) architecture:

137
The skip connections are made after every 2 layers.

Code snippets for Cat-Dog Dataset:

# Define a class for LAYER with convolution operation.


class ResLayer(nn.Module):
def __init__(self,ni,no,kernel_size,stride=1):
super(ResLayer, self).__init__()
padding = kernel_size - 2 # image size should be the same after convolution.
self.conv = nn.Sequential(nn.Conv2d(ni, no, kernel_size, stride,
padding=padding), nn.ReLU())

138
# add the previous layer input (inputs) to the current processing (self.conv) layer.
def forward(self, inputs):
outputs = self.conv(inputs) + inputs
return outputs

# Define a class for Resnet MODEL (DGTransferResNet18Model) & load pretrained weights.
def __init__(self, output_classes) -> None:
self.transfer_model =
models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Inspecting model - contains sections - convolution, batch normalization, ReLU, MaxPooling, 4
layers of ResNet blocks, avgpool, fully connected layer (FC).
#print(self.transfer_model)

# Freeze all sections/modules of our ResNet class “DGTransferResNet18Model”, except


“avgpool” & “FC” modules. Most of the code is similar to our VGG16 model described earlier .
self.transfer_model.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))
self.transfer_model.fc = nn.Sequential(nn.Flatten(), nn.Linear(512, 128),
nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, output_classes))

Besides VGG and ResNet, some of the other prominent pre-trained


models are Inception, MobileNet, DenseNet, and SqueezeNet.

2) Example code for (Multi-Regression) Facial Keypoint Detection:

Multi Regression means prediction of multiple (continuous) values from an


input. For regression, we use mean absolute error (MAE)(can be negative,
does not penalize higher errors as much as MSE, also known as L1 loss), mean
square error (MSE)(more sensitive to outliers than MAE, always positive,
penalizes large errors more, L2 loss), or root mean square error (RMSE)(better
than MSE, but not as robust to outliers than MAE) as loss function.
When the dataset is clear of outliers/noise, MSE & RMSE can be better options.

139
Facial key points denote the markings of various keypoints on the image that
contains a face.
Problems to be solved for keypoints problem:
(a) All the input images (with different sizes) need to be resized to the
same size. Hence, its labeled key points would also need to be adjusted
accordingly - Solution is, to have the keypoints represented as a value
between 0 & 1, where 0 & 1 represents the start & end size of the image
respectively i.e. keypoints location will be relative to new image
coordinates/dimensions from 0 - 1. Since the values are between 0-1, we can
use sigmoid activation at the final layer to get the outputs.

The dataset is downloaded from link:


https://github.com/udacity/P1_Facial_Keypoints.git (/data/ folder)

Code snippets for Facial Keypoints:

Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import pandas as pd
import os
import cv2

# set device type.


device = "cuda" if torch.cuda.is_available() else "cpu"

# "FaceDataset" class loads the image file names & their raw keypoint data at init.
During getitem(), it returns the loaded image, & it's normalized keypoint data in a 1D
tensor.

# Steps:
# init(): get list of (<image file name>, <keypoints for that image>). Keypoints are
in form <x1, y1, x2, y2,....,xN, yN>.
# getitem(): load the image in memory. Normalize the keypoints location according
to original image's dimensions(i.e. between 0 to 1). Return (<loaded image>,
<keypoints>). Returned keypoints are in the form <x1, x2,....,xN, y1, y2,..., yN>.

140
class FaceDataset(Dataset):
# class variable. Initialized during import of this class itself. Setup pre-
processing, for future use in __getitem__().
preprocess = transforms.Compose([transforms.ToTensor(), transforms.Resize(224),
transforms.CenterCrop(224), transforms.ConvertImageDtype(torch.float32),
transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])])

def __init__(self, input_data_path, label_filePath) -> None:


super().__init__()
self.data = [] # contains set [(file_name, label list),...].
self.path = input_data_path
# read label data from provided csv path "label_filePath".
self.label_data = pd.read_csv(label_filePath)
# keep preprocessing setup in a separate function, for convenience.
#self.preprocess_input()
self.initialize_data()

def __len__(self):
return len(self.data)

def __getitem__(self, index):


file_name, label_data = self.data[index]
image = cv2.imread(self.path + "/" + file_name)
image = image
# get image original size for preprocessing label keypoints.
image_shape = image.shape # [H, W, C]
# dividing by 255 to normalize image in 0-1 range, doesn't seem to make any
difference in VGG model performance. Also, cv2 images, when converted to tensor using
transforms.ToTensor(), doesn't seem to convert pixel values to [0-1] range.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #/255 # image dtype=float64 if
using /255.
# for debugging.
#if file_name == "Baburam_Bhattari_30.jpg":
# print("")

#image = torch.permute(image, (2, 0, 1)) # rearrange image as needed by


Pytorch i.e. [C x H x W]. No need to do this, as transforms.ToTensor() automatically
rearranges the shape, so that image has [C x H x W].
# Sample code of 2 lines, that demonstrates above statement.
#t1 = transforms.Compose([transforms.ToTensor()])
#image = t1(image)

141
image = FaceDataset.preprocess(image)
# also preprocess ground truth i.e. label data.
label_data = self.preprocess_label(label_data, image_shape)
return (image.to(device), label_data.to(device)) # shift to appropriate
device.

# preprocess label data (ground truth), such as normalizing keypoints location


between 0 to 1 along image dimensions(from image_shape). length of label data = 136.
# even columns represent X axis values of each of the 68 (136/2=68) keypoints,
while odd values represent Y axis values.
def preprocess_label(self, label_data, image_shape):
# image_shape -> [H, W, C] -> H=0, W=1.
keypoint_data_x = label_data[0::2] / image_shape[1] # X axis represents width.
Returns elements 0,2,4,...
keypoint_data_y = label_data[1::2] / image_shape[0] # Y axis represents
height. Returns elements 1,3,5,...
#label_data = torch.stack((keypoint_data_x, keypoint_data_y)) # combine above
X & Y keypoint data in 2D tensor(2 rows, 68 columns) using stack.
# use torch.cat() to keep this tensor as 1D, as model output will be 1D, so
will be easier when computing loss. length of label data = 136.
label_data = torch.cat((keypoint_data_x, keypoint_data_y), dim=0)
return label_data

# to skip hidden files in macOS.


def should_skip_file(self, fileName):
if fileName.startswith('.'):
return True
else:
return False

# extracts label for provided file name. Assumes that 1st field in csv has this
file name as parameter.
def extract_label_for_filename(self, file_name):
# extract required label data from "self.label_data". here key(image name) is
in column-number=0.
label_data = self.label_data.loc[self.label_data.iloc[:, 0] == file_name]
if label_data.empty:
return None # if "test" folder contains images that do not have
corresponding entries in "test_frames_keypoints.csv".
# remove the first column that contains the image name.

142
label_data = label_data.iloc[:, 1:]
# convert to (1D) pandas dataframe to tensor.
label_data = torch.tensor(label_data.iloc[0].values).float()
return label_data

# "path" contains complete path, up to "training" or "test" subfolder.


def initialize_data(self):
for image_name in os.listdir(self.path):
if self.should_skip_file(image_name):
continue
# store image file name, along with it's corresponding label from csv file,
in self.data.
label_data = self.extract_label_for_filename(image_name)
if label_data is not None: # sometimes, "test" folder contains images
that do not have corresponding keypoints in it's test-csv, so ignore those
situations(label_data is None in such cases).
self.data.append((image_name, label_data))

# static method - loads input image & also returns the preprocessed image, for
input to model prediction.
def get_input_processed_image(image_path):
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #/255
height, width, _ = image.shape
preprocessed_image = FaceDataset.preprocess(image)
# "image" is not a tensor.
return image, preprocessed_image.to(device), height, width

Model:
import torch
import torch.nn as nn
import torchvision
from torchvision import models
import cv2

class FaceKPModel(nn.Module):

143
def __init__(self) -> None:
super().__init__()
self.model_name = "trained_models/FaceKP_VGG16.pth"
self.model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
for param in self.model.parameters():
param.requires_grad = False # freeze weights.
#self.model_arch1()
#self.model_arch2()
self.model_arch3()
#self.model_arch4()

def forward(self, input):


outputs = self.model(input)
return outputs

# validation loss (20 epochs) ~


def model_arch1(self):
# modify layers for learning. Added a "nn.ReLU()" before flatten in below
avgpool (not present in code from ebook "Modern Computer Vision with Pytorch"), it
decreased loss from 0.02 to 0.01 within 10 epochs.
self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, 3), nn.MaxPool2d(2),
nn.ReLU(), nn.Flatten())
# output size of VGG16 features module(see summary() of model) is (512*7*7 -
it's Conv2D's are padded, so image size do not reduce by 2). The next module
"avgpool"; that we override, has 1 Conv2D with no padding, so size reduces by 2 (7-
2=5), followed by a MaxPool2d(2), that results in division by 2 (5/2=2). On
Flattening, it becomes 2048(512 * 2 * 2 = 2048).

# this classifier module has 2 linear layers.


self.model.classifier = nn.Sequential(nn.Linear(2048, 512), nn.ReLU(),
nn.Linear(512, 136), nn.Sigmoid()) # last layer of classifier module is a
sigmoid, will return values between 0 & 1 (our keypoints are normalized between 0 to
1). Last Linear layer has 136 outputs, equal to the number of keypoints to be
predicted.

# validation loss (20 epochs) ~ 0.013


def model_arch2(self):
self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, 3), nn.MaxPool2d(2),
nn.ReLU(), nn.Flatten())

144
# this classifier has less Dropout(), resulting in slightly lesser loss.
self.model.classifier = nn.Sequential(nn.Linear(2048, 512), nn.ReLU(),
nn.Dropout(0.1), nn.Linear(512, 136), nn.Sigmoid())

# Training loss ~ 0.00012 , validation loss (20 epochs) ~ 0.0127


def model_arch3(self):
# Added BatchNorm2d().
self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, 3), nn.MaxPool2d(2),
nn.BatchNorm2d(512), nn.ReLU(), nn.Flatten())
# this classifier module has 5 (linear) layers, with no dropout.
# try adding batch normalization.
self.model.classifier = nn.Sequential(nn.Linear(2048, 1024), nn.ReLU(),
nn.Linear( 1024, 512), nn.ReLU(), nn.Linear(512, 300), nn.ReLU(), nn.Linear(300, 200),
nn.ReLU(), nn.Linear(200, 136), nn.Sigmoid())

# validation loss (20 epochs) ~


def model_arch4(self):
# Added BatchNorm2d().
self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, 3), nn.MaxPool2d(2),
nn.BatchNorm2d(512), nn.ReLU(), nn.Flatten())
# increased number of units in linear layers.
self.model.classifier = nn.Sequential(nn.Linear(2048, 4096), nn.ReLU(),
nn.Linear( 4096, 1024), nn.ReLU(), nn.Linear(1024, 512), nn.BatchNorm1d(512),
nn.ReLU(), nn.Linear(512, 200), nn.ReLU(), nn.Linear(200, 136), nn.Sigmoid())

Train/Validate:
import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from FaceDataset import FaceDataset
import cv2
#from torch.utils.tensorboard import SummaryWriter

device = "cuda" if torch.cuda.is_available() else "cpu"

def train(epochs, dLoader:DataLoader, model:nn.Module, lossFn, optimizer:optim):


#ts = SummaryWriter() # SummaryWriter for tensorboard.
total_loss = 0

145
model.train()

for epoch in range(epochs):


for _, [inputs, labels] in enumerate(dLoader):
predictions = model(inputs) # (1) forward pass.

loss = lossFn(predictions.to(device), labels.to(device)) # (2) compute


loss.
optimizer.zero_grad() # (3) make zero_grad.
loss.backward() # (4) backward pass - compute gradients.
optimizer.step() # (5) update weights.

total_loss += loss.item()

print(f"Epoch {epoch}: Loss: {loss.item()}")


#ts.add_scalar("Loss", total_loss, epoch) # to view: "tensorboard --
logdir runs"

#ts.close() # close the tensorboard summary writer.

def validate(dLoader:DataLoader, model:nn.Module, lossFn):


model.eval()
total_accuracy = 0
total = 0 # holds the number of times that below loop runs.

with torch.no_grad():
for _, [inputs, labels] in enumerate(dLoader):
predictions = model(inputs)
accuracy_ratio = compute_accuracy(predictions, labels, lossFn)
total += 1 # total is just the number of times this loop runs. As
accuracy is a ratio, total accuracy = (total_accuracy / total).
total_accuracy += accuracy_ratio

print(f"Validation loss per batch: {accuracy_ratio}")

print("\n")
print(f"Total Validation loss: {total_accuracy / total}")
print("\n")

146
# this will compute the accuracy given predictions & labels. Accuracy is nothing but
the computed loss.
def compute_accuracy(predictions, labels, lossFn):
loss = lossFn(predictions, labels)
return loss.item()

import pandas as pd

# static method - plots keypoints on provided image (path).


def plot_keypoints_on_image(model, input, image, height, width):
input = torch.unsqueeze(input, 0) # add "batch" dimension to (single) input
image.
output = model(input)
output = torch.squeeze(output, 0).to(device) # output dim=[1,136], make it
[136]. "1" represents batch dimension; not a problem during training/validation, as
both predictions & labels are of dim=(<batch_size, 136>).
# multiply X values by width, Y values by height, to scale predicted keypoints wrt
actual image size.
offset = 68 # number of keypoints.
# this can also work, for scaling, instead of the loop.
#output[:offset] = output[:offset] * width
#output[offset:] = output[offset:] * height

# temp - make predictions format same as label(x1,y1,x2,y2,...); in


"temp_prediction".
"""
i = 0
temp_prediction = []
while i < offset:
temp_prediction.append(output[i].item())
temp_prediction.append(output[i+offset].item())
i += 1
temp_prediction = torch.tensor(temp_prediction)
# scale "temp_prediction" according to image dimensions, in
"temp_prediction_scaled".
temp_prediction_scaled = torch.zeros(136)
for i in range(0, offset*2, 2):
temp_prediction_scaled[i] = temp_prediction[i] * width
temp_prediction_scaled[i+1] = temp_prediction[i+1] * height

147
# can compare "temp_prediction" with "o1" (actual keypoints from csv) whole
debugging in data viewer.
"""
#temp

# plot predictions in green.


for i in range(offset):
output[i] = output[i] * width # scaling X values.
output[i+offset] = output[i+offset] * height # scaling Y values.
# plot these scaled (actual) keypoints on image.
image = cv2.circle(image, (int(output[i]), int(output[i+offset])), 2,
(0,255,0), 2)

# plot actual keypoints in red.


"""
df = pd.read_csv("../dataset/training_frames_keypoints.csv")
o1 = df.loc[df.iloc[:, 0] == "Adrian_Nastase_42.jpg"]
o1 = o1.iloc[:, 1:]
o1 = torch.tensor(o1.iloc[0].values).float()
for i in range(0, offset*2, 2): # loop over all 136 elements.
# Note that actual keypoints are in (x1, y1, x2, y2,...) format.
x_kp = o1[i]
y_kp = o1[i+1]
image = cv2.circle(image, (int(x_kp), int(y_kp)), 2, (0,0,255), 2)
"""

print("\nPlotting of keypoints completed.\n")


cv2.imshow("window1", image)
cv2.waitKey(0) # show image in window & wait for user to press any key to
exit.

Main.py:
import torch
import torch.nn as nn
from FaceDataset import FaceDataset
from torch.utils.data import DataLoader
from FaceKeypointModel import FaceKPModel
import torch.optim as optim
from train_validate import train, validate, plot_keypoints_on_image

device = "cuda" if torch.cuda.is_available() else "cpu"

148
def run_training():
training_path = "../dataset/training"
label_file_path = "../dataset/training_frames_keypoints.csv"
faceKPdataset = FaceDataset(training_path, label_file_path)
dLoader = DataLoader(faceKPdataset, batch_size=100, shuffle=True)
model = FaceKPModel().to(device)
#lossFn = nn.L1Loss() # L1Loss = Mean Absolute Error(MAE).
lossFn = nn.MSELoss()
optimizerFn = optim.Adam(model.parameters(), lr=0.0005)
epochs = 20
# start training.
print("------------Training Started-------------")
train(epochs, dLoader, model, lossFn, optimizerFn)
print("------------Training Complete-------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

# loads already saved model from "model_path" & resumes training.


def resume_training(model_path, epochs=None):
training_path = "../dataset/training"
label_file_path = "../dataset/training_frames_keypoints.csv"
faceKPdataset = FaceDataset(training_path, label_file_path)
dLoader = DataLoader(faceKPdataset, batch_size=100, shuffle=True)
model = FaceKPModel().to(device)
# usually, model_path = model.model_name.
result = model.load_state_dict(torch.load(model_path)) # (1) load saved model.
lossFn = nn.MSELoss()
optimizerFn = optim.Adam(model.parameters(), lr=0.0005)
if epochs == None:
epochs = 20 # (2) set default epoch value, if not provided.
# start training.
print("------------Training Resumed-------------")
train(epochs, dLoader, model, lossFn, optimizerFn)
print("------------Training Complete-------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

149
def run_validation():
test_path = "../dataset/test"
label_file_path = "../dataset/test_frames_keypoints.csv"
faceKPdataset = FaceDataset(test_path, label_file_path)
dLoader = DataLoader(faceKPdataset, batch_size=100, shuffle=False)
model = FaceKPModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
lossFn = nn.L1Loss()
print("------------Validation Started-------------")
validate(dLoader, model, lossFn)
print("------------Validation Complete-------------")

# shows keypoints on a test image, using model outputs.


def test_faceKP_on_image():
#image_path = "../dataset/sample_test/indian_man1.png"
#image_path = "../dataset/sample_test/indian_woman1.png"

image_path = "../dataset/training/Adrian_Nastase_42.jpg"
image, preprocessed_image, height, width =
FaceDataset.get_input_processed_image(image_path)
model = FaceKPModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
plot_keypoints_on_image(model, preprocessed_image, image, height, width)

#run_training()

#resume_training("trained_models/FaceKP_VGG16.pth")

#run_validation()

test_faceKP_on_image()

150
- torch_snippets library:

Many of the code in model training have common & repetitive code that has to be
repeated every time. torch_snippets library provides one line functions for such
common tasks, that shortens our development time & provides convenience.
Many operations, for example: reading an image, showing an image,
the entire training loop, etc are repetitive & can be reused in single function calls.
Moreover, subtle things are taken care of by torch_snippets library, such as when
reading images using cv2, images are internally converted into [C x H x W] as
required by Pytorch, & so on.
Torch_snippet can be installed using pip: pip install torch-snippets.
Additional dependencies:
pip install fitz
pip install PyMuPDF

Import in code using: from torch_snippets import *

3) Example code for (Multi-Task) Age Estimation & Gender


Classification:

We are predicting 2 attributes, continuous (age) and categorical (gender)


predictions, in a single forward pass.
In the last (overriding) part of the model, create two separate layers
branching out from the preceding layer, where one layer corresponds to age
estimation and the other to gender classification.
We have different loss functions for each branch of output, as age is a
continuous value (requiring an MSE or MAE loss calculation) and gender is a
categorical value (requiring a Cross-Entropy loss calculation).
Take a weighted summation of age estimation loss and gender
classification loss.

151
Database link:
https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/
blob/master/Chapter05/age_gender_prediction.ipynb
(Open link in google colab & download database from code -
getFile_from_drive() section).

Multi-Task Prediction Model:

- We use a pretrained VGG model, overriding the avgpool module &


classifier module as usual. However, we create a new class
"AgeGenderClassifier" from nn.Module (& assign it as the new/overriding
classifier), that has an "intermediate" sub module, followed by 2 diversified sub
modules "age_classifier" & "gender_classifier", that do separate predictions for
age & gender, from output of "intermediate" module.

- This new classifier "AgeGenderClassifier" (& hence our encompassing


model) then returns 2 outputs "age" & "gender", whose separate losses (MSE
loss for age & BCE loss for gender) can be computed, using their respective
ground truths.

- Calculate the overall loss by summing the losses for age & gender, &
perform backpropagation on this overall loss.

Code Snippets:

Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"

class AgeGenderDataset(Dataset):
AGE_RANGE = 80 # the actual scale/range of age (0 to 80).

152
preprocess = transforms.Compose([transforms.ToTensor(), transforms.Resize(224),
transforms.CenterCrop(224), transforms.ConvertImageDtype(torch.float32),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

# path -> "../dataset/". label_filename -> "fairface-label-train.csv" (training) or


"fairface-label-val.csv" (validation).
# "label_filename" itself contains path to input images(to be loaded later), along
with age & gender info.
def __init__(self, path, label_filename, ROWS) -> None:
super().__init__()
self.path = path
# read training & validation csv, that also contains path to image files.
df = pd.read_csv(path + label_filename, nrows=ROWS) # load only 1st 5000
rows.
# convert categorical field "gender" to numerical (Male/Female = True/False =
0/1). For BCELoss, convert labels type to float(i.e. 0.0/1.0).
df.loc[:,"gender"] = (df.loc[:, "gender"] == "Female").astype(np.float32)
# convert age column's type to float too, otherwise it causes problems while
computing loss.backward(), as it expects all types to be of type float. also,
Normalize age values from 0-80 to 0-1.
df.loc[:,"age"] =
(df.loc[:,"age"]/AgeGenderDataset.AGE_RANGE).astype(np.float32)
self.input_df = df

def __len__(self):
return len(self.input_df)

def __getitem__(self, index):


# return the loaded input image & it's labels: age,gender.
row = self.input_df.iloc[index]
image = cv2.imread(self.path + "fairface-img-margin025-trainval/" +
row["file"])
# reorder channels in RGB & normalize pixel values between 0-1 by dividing by
255.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)/255
image = AgeGenderDataset.preprocess(image)
return image.to(device), row["age"], row["gender"]

153
# static method to prepare single sample input "input_name" for prediction.
def test_input(input_name):
image = cv2.imread(input_name)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)/255
image = AgeGenderDataset.preprocess(image)
image = torch.unsqueeze(image, 0) # add batch dimension to single input
image.
return image.to(device)

Model:
import torch
from torchvision import models
import torch.nn as nn
from AgeGenderClassifierSubModule import AgeGenderClassifierSubModule

class AgeGenderModel(nn.Module):
model_name = "trained_models/AgeGenderModel_VGG16.pth"

def __init__(self) -> None:


super().__init__()
self.model_name = "trained_models/MultiTask_AgeGender_VGG16.pth"
self.model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
# disable learning on all layers.
for param in self.model.parameters():
param.requires_grad = False
#summary(model, torch.zeros(1, 3, 224, 224))
# override layers. Input to model.avgpool = (512,7,7)
self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, kernel_size=3),
nn.MaxPool2d(2), nn.ReLU(), nn.Flatten()) # input image size = 7x7. Conv2d = 7-
2=5, Max2D = 5/2=2. output = 512*2*2 = 2048.
self.model.classifier = AgeGenderClassifierSubModule()

def forward(self, input):


age_predictions, gender_predictions = self.model(input)
return age_predictions, gender_predictions

Model - AgeGenderClassifierSubModule:
import torch

154
import torch.nn as nn

class AgeGenderClassifierSubModule(nn.Module):
def __init__(self) -> None:
super().__init__()
self.intermediate = nn.Sequential(nn.Linear(2048, 512), nn.ReLU(),
nn.Linear(512, 128), nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, 64), nn.ReLU())
# age classifier (1 output neuron with sigmoid). Age prediction is between (0-
1) to be scaled by 80.
self.age_classifier = nn.Sequential(nn.Linear(64, 1), nn.Sigmoid())
# gender classifier (1 output neuron with sigmoid). Gender prediction is
modified to either 0 or 1 as output, from sigmoid's output.
self.gender_classifier = nn.Sequential(nn.Linear(64, 1), nn.Sigmoid())

def forward(self, input):


inter_output = self.intermediate(input)
# diversify output predictions for age & gender, & return both predictions.
age_prediction = self.age_classifier(inter_output)
#age_prediction *= 80 # age predictions to be scaled by 80. can be done
separately, outside this file too, however, doing it here removes the need for others
using this model, to manually do so explicitly. CANNOT DO THIS, AS IT INTERFERES IN
GRADIENTS DURING loss.backward(). Better to modify/scale predicted age separately,
during testing.
gender_prediction = self.gender_classifier(inter_output) # gender
classification is probabilities, between 0 & 1. It is expected that correct
probability values will be closer to 0 for Male (index 0) & 1 for Female(index 1);
according to label gender data, minimizing loss after training.
return age_prediction, gender_prediction

Train/Validate:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim

device = "cuda" if torch.cuda.is_available() else "cpu"

155
def train(epochs, model:nn.Module, dLoader:DataLoader, age_lossFn, gender_lossFn,
optimFn:optim):
model.train()
torch.autograd.set_detect_anomaly(True)
for epoch in range(epochs):
for _, [input_image, age_labels, gender_labels] in enumerate(dLoader):
optimFn.zero_grad()
age_predictions, gender_predictions = model(input_image)
# # BCELoss() expects both predictions & labels to be of type float.
BCELoss seems to calculated as: (number of incorrect predictions / total predictions).
age_predictions = age_predictions.squeeze().to(device) # remove all
dimensions of size 1 from tensor's shape. "age_predictions" original shape = [<batch
size>,1], while "age_labels" shape=[<batch size>].
age_loss = age_lossFn(age_predictions, age_labels)
# If gender classification module has 2 outputs(one-hot encoded style):
convert "gender_predictions" from one-hot to 1D tensor, like labels. Since we are
using only 1 output neuron for gender classification in our model, no need to use one-
hot-encoding conversion to 1D (using argmax() below), as output for gender
classification is already in 1D form.
#gender_predictions = torch.argmax(gender_predictions, dim=1)
gender_predictions = gender_predictions.squeeze().to(device) # remove
all dimensions of size 1 from tensor's shape. "gender_predictions" original shape =
[<batch size>,1].
gender_loss = gender_lossFn(gender_predictions, gender_labels)
# sum both losses to get total loss, that we will use to backpropagate.
total_loss = age_loss + gender_loss
total_loss.backward()
optimFn.step()
print(f"Epoch:{epoch} - Training: Age Loss = {age_loss.item()} , Gender Loss =
{gender_loss.item()} , Total Loss = {total_loss.item()}")

def validate(model:nn.Module, dLoader:DataLoader, age_lossFn):


model.eval()
total_batches = 0
total_age_loss = 0
total_gender_accuracy = 0
with torch.no_grad():
for _, [input_image, age_labels, gender_labels] in enumerate(dLoader):
age_predictions, gender_predictions = model(input_image)

156
# # BCELoss() expects both predictions & labels to be of type float.
BCELoss seems to calculated as: (number of incorrect predictions / total predictions).
age_loss = age_lossFn(age_predictions, age_labels)
age_predictions = age_predictions.squeeze().to(device)
# gender accuracy can be computed, as it is categorical in nature.
gender_predictions = gender_predictions.squeeze().to(device)
gender_accuracy = compute_gender_accuracy(gender_predictions,
gender_labels)
print(f"Validation per Batch: Age Loss = {age_loss.item()} , Gender
Accuracy Ratio = {gender_accuracy}")
total_age_loss += age_loss.item()
total_gender_accuracy += gender_accuracy
total_batches += 1
print(f"Average: Age Loss = {total_age_loss/total_batches} , Gender Accuracy
Ratio = {total_gender_accuracy/total_batches}")

# computes the accuracy for categorical variable/field.


def compute_gender_accuracy(predictions, labels):
# no need to use one-hot-encoding to 1D (using argmax() below), as output for
gender classification is already in 1D form, due to using only 1 output neuron in
model's gender classification.
#predictions = torch.argmax(predictions, dim=1) # convert one hot to 1D
tensor, like labels.
# convert probabilities in predictions to 0/1, as in labels.
predictions = (predictions > 0.5).to(torch.int32)
accuracy = (predictions == labels).sum()
return accuracy.item()/len(labels)

Main:
import torch
import torch.nn as nn
from AgeGenderDataset import AgeGenderDataset
from torch.utils.data import DataLoader
from AgeGenderModel import AgeGenderModel
from train_validate import train, validate
import torch.optim as optim

157
device = "cuda" if torch.cuda.is_available() else "cpu"

ROWS = 10000

def run_training():
dataset_path = "../dataset/"
train_labels = "fairface-label-train.csv"
epochs = 20
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=True, drop_last=True)
model = AgeGenderModel().to(device)
age_lossFn = nn.MSELoss()
gender_lossFn = nn.BCELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Started-----------------")
train(epochs, model, dLoader, age_lossFn, gender_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

def run_validation():
dataset_path = "../dataset/"
train_labels = "fairface-label-val.csv"
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS/10)
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=False, drop_last=True)
model = AgeGenderModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
age_lossFn = nn.MSELoss()
print("------------Validation Started-----------------")
validate(model, dLoader, age_lossFn)

158
print("------------Validation Complete-----------------")

# to resume training.
def resume_training(epochs = 20):
dataset_path = "../dataset/"
train_labels = "fairface-label-train.csv"
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS)
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=True, drop_last=True)
model = AgeGenderModel().to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
age_lossFn = nn.MSELoss()
gender_lossFn = nn.BCELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, age_lossFn, gender_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

# test_on_input() performs prediction on a single input image.


def test_on_input(input_filename):
# load & preprocess input.
input_image = AgeGenderDataset.test_input(input_filename)
model = AgeGenderModel().to(device)
# load previously trained model.
result = model.load_state_dict(torch.load(model.model_name))
# perform prediction.
age_predictions, gender_predictions = model(input_image)
# output predictions. can cross verify with actual age,gender from csv, if label
data is available.
age_predictions = age_predictions.squeeze().to(device)

159
age_predictions *= AgeGenderDataset.AGE_RANGE # scale to actual age range.
age_predictions = age_predictions.to(torch.int32)
gender_predictions = gender_predictions.squeeze().to(device)
# modify gender predictions to actual values i.e. Male/Female.
gender_value_prediction = None
if(gender_predictions.item() > 0.5):
gender_value_prediction = "Female"
else:
gender_value_prediction = "Male"
print("\n")
print(f"Predictions: Age = {age_predictions.item()} , Gender =
{gender_value_prediction}")
print("\n")

#run_training()

#resume_training()

run_validation()

#test_on_input("../dataset/fairface-img-margin025-trainval/val/6000.jpg") # Actual:
age = 29, gender = Male. Predictions: age = 43, gender = Male.
#test_on_input("../dataset/fairface-img-margin025-trainval/val/6002.jpg") # Actual:
age = 24, gender = Female. Predictions: age = 28, gender = Female.
#test_on_input("../dataset/fairface-img-margin025-trainval/val/6008.jpg") # Actual:
age = 57, gender = Female. Predictions: age = 36, gender = Female.
test_on_input("../dataset/fairface-img-margin025-trainval/val/6010.jpg") # Actual:
age = 10, gender = Male. Predictions: age = 33, gender = Male.

Class Activation Maps(CAM):

Class activation maps are a simple technique to get the discriminative image
regions used by a CNN to identify a specific class in the image. In other words, a
class activation map (CAM) lets us see which regions in the image were relevant
to this class.

160
Class Activation Maps at different stages/layers.

Practical Aspects to take care during model training:

(1) Imbalanced Data:


When trying to predict an object that occurs very rarely within the
dataset- say in 1% of the data. If the model misses detecting this object when it
actually occurs in the test dataset, even if it correctly detects that other data does
not contain the object (i.e. 99% right, but 1% wrong), the model is still useless.

We can also use Class weights for imbalance in data (in addition to loss
function) i.e. assigning higher weights to rarely occurring classes, thereby
ensuring that we explicitly mention to the model that we want to correctly classify
the rare class.
These weights can be provided to the loss function under the “weights”
arguments:
Ex (CrossEntropy loss):
#class weights for 6 class multi-class classification
class_weights = [0.5281, 0.8411, 0.9619, 0.8634, 0.8477, 0.9577]

#loss function with class weights


criterion = nn.CrossEntropyLoss(weight = class_weights)

161
Formula for calculating weights:

Ex (MSE loss):
def mse_loss(input, target):
return torch.sum((input - target) ** 2).mean() # vanilla MSE loss.

def weighted_mse_loss(input, target, weight):


return torch.sum(weight * (input - target) ** 2).mean() # weighted MSE loss.

The above code demonstrates how weights can be used with


regression losses such as MSE, MAE, etc.

ViT (Vision Transformer):

ViT uses transformers instead of CNNs for performing vision tasks such as
classification & detection.

162
Tokenization: The input image is divided into fixed-size non-overlapping
patches or tokens.
Each patch is treated as a flattened (1D) sequence of features (e.g. like an
embedding vector of a single token in NLP), and these patches are linearly
embedded into a lower-dimensional space to serve as the model's input.
Transformer requires the input token’s embedding vector to be of a fixed
size ‘D’, so a patch is mapped to a vector of dimension ‘D’ via a learnable linear
projection.
Each patch is analogous to a token(embedding vector), and all patches
together form the input sequence to the ViT. If (P,P) is the size of the patch
(downsampling ratio - lower P means higher feature resolution & vice versa),
then N = (H/P) * (W/P) is the number of patches; size of a single patch is (P2 * C);
formed for the input image with height ‘H’, width ‘W’, number of channels is ‘C’.
Size of the entire input sequence will be (N * P2 * C).

Positional Encoding: Just like in NLP Transformers, Vision


Transformers incorporate positional information into the input data. This is
typically done using 1D learnable positional encodings, which provide information
about the spatial location of each patch in the image.

Position embeddings of models trained with different hyperparameters.

A [CLS] token is added to serve as representation of an entire image,


which can be used for classification.
The CLS token's representation, while not learned during training,
captures a global view of the image and plays a crucial role in aggregating
information from the entire image for downstream tasks. The learned weights of
the model are used to transform and process the information from the CLS token
and other patch embeddings to make predictions.

163
Self Attention: ViT uses self-attention mechanisms to compute
relationships between different patches in the image. This allows the model to
attend to relevant patches and learn contextual information for each patch.

Classification Head: Vision Transformers are often used for image


classification tasks. They typically have a classification head attached to the
output of the transformer encoder, which maps the final hidden representations
to class scores.

ViT does not introduce image-specific inductive biases into the architecture apart
from the initial patch extraction step.

Vision Transformer (ViT) attains excellent results when pre-trained at sufficient


scale/huge datasets and transferred to tasks with fewer data points.
ViT-L/16 means the “Large” variant with 16 × 16 input patch size.
All training is done on resolution 224. During fine-tuning, it is often beneficial to
use a higher resolution than pre-training.

Hybrids (ViT with CNN backbone) slightly outperform ViT at small computational
budgets, but the difference vanishes for larger models. This result is somewhat
surprising, since one might expect convolutional local feature processing to
assist ViT at any size.
Self-attention allows ViT to integrate information across the entire image
even in the lowest layers. The “attention distance” is analogous to receptive
field size in CNNs.
Some heads attend to most of the image already in the lowest layers (i.e.
blocks - network depth), showing that the ability to integrate information globally
is indeed used by the model. Other attention heads have consistently small
attention distances in the low layers. This highly localized attention is less
pronounced in hybrid models that apply a ResNet before the Transformer,
suggesting that it may serve a similar function as early convolutional layers in
CNNs.

164
The position embeddings learn to represent 2D image topology explains why
hand-crafted 2D-aware embedding variants do not yield additional
improvements.

(B) OBJECT DETECTION:

Classification vs Localization vs Detection (Classification & Localization together


for multiple objects/classes in an image).

165
Multiclass: An image belonging to one class out of several possible classes.

Multiclass Detection.

Multilabel: An image with multiple labels assigned to it simultaneously.

Multilabel Detection.

YBat Tool:
For preparing dataset that contains ground truth (bounding box coordinates &
classes of objects) for given input images, we can use data annotation tools,
such as Ybat (Yolo BBox Annotation Tool). It is available at:
https://github.com/drainingsun/ybat

Steps to use YBat:

(1) Create a txt file for classes, that contain all the classes names (1 class per
line); say “classes.txt”. Upload it in the (opened) page ybat.html, under the
“Classes: Choose File” button.

166
(2) Prepare ground truth (drawing BoundingBox or BB around target objects
in images/dataset). Upload the images in the “Images: Choose Files” option, by
selecting all the images that are to be included for annotation.
(3) Before performing annotation(creating BB), be sure to select the
correct class from the class list. Perform annotation for all desired objects in all
images.
NOTE: BBs, once created, can be resized, moved around or deleted
using delete key (Right Cmd + Delete on mac). Restore will bring back any
deleted BBs.
(4) Save the annotated data using “Save Yolo” button. This downloads
a zipped .txt file/s of the BBs of all objects, per image/file.
For Yolo format, the classes are numbered 0 onwards, & all the BB
coordinates(x-center, y-center, width, height) are normalized from 0(origin - top
left) to 1(image dimensions - bottom-right).
X-center & y-center are the center-point of the BB. To normalize coordinates, we
divide x values by width of image, & y values by height of image.
YBat also allows saving in COCO (outputs in JSON - x-left, y-top, width,
height) or VOC (outputs in xml - xmin-left, ymin-top,xmax-right, ymax-bottom)
format. Above formats also provide the original image dimensions.

Some important concepts in Object Detection are:


- Region Proposals
- IoU
- Non-maximum suppression
- mAP (mean average precision)

Region Proposals:

Region proposal is a technique that helps in identifying (rectangular) islands of


regions where the pixels are similar to one another.
It aids in object localization where the task is to identify a bounding box that fits
exactly around the object in the image.
The general idea is that a region proposal algorithm should inspect the image
and attempt to find regions of an image that likely contain an object - instead of
using the more time consuming; window sliding method.

167
A Region Proposal Network, or RPN (backbone of Faster R-CNN), is a fully
convolutional network that simultaneously predicts probable object bounds
(rectangle) and objectness scores at each position. The RPN is trained end-to-
end to generate high-quality region proposals.

A region proposal that has a high intersection (computed using IoU - Intersection
over Union) with the location (ground truth) of an object in the image of interest is
labeled as the one that contains the object, and a region proposal with a low
intersection is labeled as background.

Intersection within the term Intersection over Union measures how overlapping
the predicted and actual bounding boxes are, while Union measures the overall
space possible for overlap. IoU is the ratio of the overlapping region between the
two (predicted & ground truth) bounding boxes over the combined region of both
the bounding boxes.

SelectiveSearch is an algorithm for performing regional proposals. It is


designed to be fast with a very high recall.
It is based on computing hierarchical grouping of similar regions based on colour,
texture, size and shape compatibility.
It can be installed as: pip install selectivesearch & used as
import selectivesearch
Ex:
# scale is the number of allowed regions/segmentations.
img_label, regions = selectivesearch.selectivesearch(input_img, scale=200,
min_size=100)

168
Non Max Suppression:

When multiple region proposals are generated and (their BBs) significantly
overlap one another, Non-Max Suppression can be used to select the best BB
containing the object of interest, out of all the BBs.

“Non-max” refers to the boxes that do not contain the highest probability of
containing an object, and “suppression” refers to us discarding those boxes that
do not contain the highest probabilities of containing an object. In non-max
suppression, we identify the bounding box that has the highest probability and
discard all the other bounding boxes that have an IoU greater than a certain
threshold with the box containing the highest probability of containing an object.

In PyTorch, non-max suppression is performed using the nms function in the


torchvision.ops module. The nms() function takes the bounding box coordinates,
the confidence of the object in the bounding box, and the threshold of IoU across
bounding boxes, to identify the bounding boxes to be retained.

mAP:

mAP (Mean Average Precision) is a popular metric used to compute the


accuracy of the predicted class & it’s BB by the model.

169
mAP is measured by taking the mean of all average precisions (the area under a
Precision vs Recall curve) across all IoU thresholds and for all classes. This
metric provides an overall model performance, irrespective of any manually-set
threshold.

Precision is defined as:

i.e. P = TP / TP + FP.
P = TP / Total (positive) Predictions (of that class by the model).
Precision measures how accurate your predictions are. i.e. the percentage of
correct predictions (from all predictions). i.e. identifies only positive cases.
A True Positive refers to the bounding boxes that predict the correct class
of objects and that have an IoU with the ground truth that is greater than a certain
threshold. A False Positive refers to the bounding boxes that predicted the class
incorrectly or have an overlap that is less than the defined threshold with the
ground truth. Furthermore, if there are multiple bounding boxes that are identified
for the same ground truth bounding box, only one box can get into a true positive,
and everything else gets into a false positive.
Precision is the ability of a model to identify only the relevant objects.

Recall is defined as:

170
i.e. R = TP / TP + FN.
R = TP / Total (positive) Ground-Truth.
Recall measures how well you find all the (actual) positives. i.e. identifies all of
the positive cases.
Recall is the ability of a model to find all the relevant (ground truth) objects.

High scores for both show that the classifier is returning accurate results
(high precision), as well as returning a majority of all positive results (high recall).
A system with high recall but low precision returns many results, but most
of its predicted labels are incorrect when compared to the training labels. A
system with high precision but low recall is just the opposite, returning very few
results, but most of its predicted labels are correct when compared to the training
labels. An ideal system with high precision and high recall will return many
results, with all/most results labeled correctly.
Precision and Recall are the two most common metrics that take into
account class imbalance (when you observe more data points of one class than
of another).
These quantities are also related to the (F1) score (evaluation metric that
measures a model's accuracy), which is defined as the harmonic mean of
precision and recall.

F1 = 2 * (P * R / P + R)

F1 score has been designed to work well on imbalanced data.


In the F1 score, we compute the average of precision and recall. They are both
rates, which makes it a logical choice to use the harmonic mean (alternative

171
metric for the more common arithmetic mean), to handle any potential imbalance
in precision/recall values, because it punishes any extreme values.
Harmonic mean is near to the smallest of the input numbers,
minimizing the impact of the large outliers and maximizing the impact of small
ones.
Ex: A classifier with a precision of 1.0 and a recall of 0.0 has a simple
average of 0.5 but an F1 score of 0.
Since the F1 score is an average of Precision and Recall, it means that the
F1 score gives equal weight to Precision and Recall:
- A model will obtain a high F1 score if both Precision and Recall are high.
- A model will obtain a low F1 score if both Precision and Recall are low.
- A model will obtain a medium F1 score if one of Precision and Recall is low
and the other is high.
F1 score ranges from 0 to 1.

Average Precision: Average precision is the average of precision values (for


a given single class) calculated at various IoU thresholds.

mAP: mAP is the average of precision values calculated at various


IoU threshold values across all the classes of objects present within the dataset.

For detection, a common way to determine if one object proposal was right
is Intersection over Union (IoU). Commonly, IoU > 0.5 means that it was a hit,
otherwise it was a fail. If one wants better proposals, one does increase the IoU
from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote
this with mAP@p, where 𝑝∈(0,1) is the IoU threshold.

Example of IoU scores in mAP.


mAP@[50:95] (sometimes denoted as mAP@[.5,.95]) means average
mAP over different IoU thresholds, from 0.5 to 0.95, step 0.05 (0.5, 0.55, 0.6,
0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95).

172
Ex: mAP@0.5 = 0.98 means mAP with IoU=0.5 (50% overlap) has an
accuracy of 0.98 (98%).

Object detection can be done using several model architectures, such as R-CNN
(Region-based CNN), Fast R-CNN, Faster R-CNN, YOLO, SSD, combining CV
with NLP using transformers (using positional embedding to identify regions
containing the object) such as DETR (Detection Transformer), etc.

R-CNN:

R-CNN assists in identifying both the objects present in the image and the
location of objects within the image.

Working & Steps:

(1) Extract region proposals from an image: Ensure that we extract a high
number of proposals to not miss out on any potential object within the image.

(2) Resize (warp) all the extracted regions to get images of the same size.

(3) Pass the resized region proposals through a network: Typically, we pass
the resized region proposals through a pretrained model such as VGG16 or
ResNet50 and extract the features in a fully connected layer. We can also use
MobileNet(v2) instead of VGG16, for feature maps, it is smaller & gives similar
accuracy, while being faster.

173
(4) Create data for model training, where the input is features extracted by
passing the region proposals through a pretrained model, and the outputs are the
class corresponding to each region proposal and the offset of the region proposal
(RP) from the ground truth corresponding to the image:
If a region proposal has an IoU greater than a certain threshold with the
object, we prepare training data in such a way that the concerned region is
responsible for predicting the class of object it is overlapping with and also the
offset of the region proposal with the ground truth bounding box that contains the
object of interest (computed as {BB_coordinates - RP_coordinates}).

We calculate the offset between the region proposal bounding box and the
ground truth bounding box as the difference between center coordinates of the
two bounding boxes (dx, dy) and the difference between the height and width of
the bounding boxes (dw, dh).

(5) Connect two output heads, one corresponding to the class of image and
the other corresponding to the offset of the region proposal with the ground truth
bounding box to extract the fine bounding box on the object (similar to Multi-Task
training - categorical class variable & continuous offset variable).

(6) Train the model, writing a custom loss function that minimizes both object
classification error and the bounding box offset error.

1) Example code for R-CNN object detection on Bus-Truck:

For the scenario of object detection, we will download the data from the Google
Open Images v6 dataset (available at
https://www.kaggle.com/datasets/sixhky/open-images-bus-trucks). However, in
code, we will work on only those images that are of a bus or a truck.

174
Object detection includes defining the functions/operations for:
(a) region proposal extraction
(b) IoU calculation

Illustration of populating IoU for candidates/RP for BBs(in case a single image
contains multiple BB for multiple objects/labels):

IoUs for (multiple) BBs in a given single image.

ious = np.array([[extract_iou(candidate, _bb_) for candidate in candidates] for


_bb_ in bbs])
Above code snippet gives output in form (for above image):
ious = [[0.1, 0.9, 0, 0], [0, 0, 0.2, 0.8]], where 1st list in “ious” corresponds to bb1,
& 2nd list corresponds to bb2 - i.e. format is ious = [[c1, c2, c3, c4] - for bb1, [c1,
c2, c3, c4] - for bb2].
ious = ious.T, i.e. transpose, gives output in the form:
ious = [[0.1, 0], [0.9, 0], [0, 0.2], [0, 0.8]], where 1st element in each list inside
“ious” corresponds to bb1, & the 2nd element in each list corresponds to bb2 i.e.
the format is in form ious = [[bb1, bb2] - for c1, [bb1, bb2] - for c2, [bb1, bb2] - for
c3, [bb1, bb2] - for c4].

175
We then find the best IoU for that candidate/RP, which in turn gives the best BB
corresponding to that IoU - i.e. which ground truth BB this candidate best
corresponds to.
Then, if this best IoU is above a threshold, we assign the label for this RP as the
label for the corresponding ground truth BB; else we assign the label as
background.

Dataset:

import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
from CommonObjDetectionFunctions import generate_region_proposals, compute_iou,
assign_classes_via_IoU, compute_BB_offsets
from torchvision.ops import nms # for non-maximum suppression.

device = "cuda" if torch.cuda.is_available() else "cpu"

class BusTruckDataset(Dataset):
resize_value = 224
labels = ["Background", "Bus", "Truck"]
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(resize_value), transforms.CenterCrop(resize_value),
transforms.ConvertImageDtype(torch.float32), transforms.Normalize(mean=[0.485, 0.456,
0.406], std=[0.229, 0.224, 0.225])])

def __init__(self, images_path, dataset_path, Rows, train) -> None:


super().__init__()
self.image_path = images_path # path that points to the actual images.
self.data = []
# prepare the data to be loaded into IMG_PATHS, RP_LOCATIONS, LABELS, DELTAS.
self.prepare_data(dataset_path, Rows, train)

def __len__(self):

176
return len(self.data)

def __getitem__(self, index):


image_name, rp, label, gt_bb_offset = self.data[index]
# read entire image & extract (unnormalized) RP from it in 'rp_img'.
img = cv2.imread(self.image_path + image_name + ".jpg") # read entire
image.
x_min = rp[0]
y_min = rp[1]
x_max = rp[2]
y_max = rp[3]
# cv2 internally uses numpy arrays. numpy expects format -> [y:y+h, x:x+w]
sub_img = img[y_min:y_max, x_min:x_max]
rp_img = sub_img.copy() # create an independent copy of region proposal sub
image to return.
rp_img = rp_img/255 # convert range between 0-1.
rp_img = BusTruckDataset.preprocess(rp_img) # perform preprocessing.
# return RP, it's assigned label & ground truth BB offset. Move all these data
to 'device'.
return rp_img.to(device), torch.tensor(label, dtype=torch.int64).to(device),
torch.tensor(gt_bb_offset, dtype=torch.float32).to(device)

# flattens data in a format suitable for usage in getitem(). Note that if label is
0/background, delta will be empty.
def flatten_data(self, IMG_PATHS, RP_LOCATIONS, LABELS, DELTAS):
self.data = []
for sub_list_index, candidates in enumerate(RP_LOCATIONS):
image_name = IMG_PATHS[sub_list_index] # get image name.
for element_index, rp in enumerate(candidates):
# get the label for this RP.
labels_sublist = LABELS[sub_list_index]
label = labels_sublist[element_index]
# get delta for this RP.
deltas_sublist = DELTAS[sub_list_index]
delta = deltas_sublist[element_index]
# accumulate image name, RP, label, delta.
self.data.append((image_name, rp, label, delta))
#print(len(self.data)) # this should be same as self.DATA_LENGTH (used for
testing only).

177
# prepares data, to be consumed by thisDataset. prepare_data() is somewhat specific
to the data format provided in csv dataset file.
def prepare_data(self, input_csv_file, Rows, train):
IMG_PATHS = [] # holds image names from dataset.
RP_LOCATIONS = [] # for EACH entry in 'IMG_PATHS', holds it's (list of) RP
locations. It's a list of lists. The values are not normalized, so the raw values can
be used to fetch sub section (RP) after loading the image.
LABELS = [] # for each entry in 'RP_LOCATIONS', holds it's class
values(list of lists).
DELTAS = [] # for each entry in 'RP_LOCATIONS', holds the offsets from
ground truth BB for that particular individual RP.
if train:
# for training, use first 'Rows'.
df = pd.read_csv(input_csv_file, nrows=Rows)
else:
# for validation, use last 'Rows'.
df = pd.read_csv(input_csv_file)
df = df.tail(Rows)
# Start extracting data & presenting it in desired form.
for _, row in df.iterrows():
# generate RPs for this image, and append all these RPs along with their
classes (or background)
img = cv2.imread(self.image_path + row["ImageID"] + ".jpg")
candidates, normalized_candidates = generate_region_proposals(img)
bb = row["XMin"], row["YMin"], row["XMax"], row["YMax"] # ground truth
(normalized) BB.
# compute IoUs for each candidate in candidates list of current image
'row["ImageID"]'.
curr_image_ious = [compute_iou(c, bb) for c in normalized_candidates]
curr_RP_classes = [assign_classes_via_IoU(iou, row["LabelName"],
BusTruckDataset.labels) for iou in curr_image_ious]
IMG_PATHS.append(row["ImageID"])
RP_LOCATIONS.append(candidates) # un-normalized RP locations, for
directly getting sub image(RP) from loaded input image.
LABELS.append(curr_RP_classes)
# compute offsets from ground truths(bb) for each (normalized) candidate,
only if assigned class is not of type background.
curr_deltas = compute_BB_offsets(normalized_candidates, bb,
curr_RP_classes)
DELTAS.append(curr_deltas)
# flatten data for further usage.
self.flatten_data(IMG_PATHS, RP_LOCATIONS, LABELS, DELTAS)

178
# Testing - calculate the length of data - use LABELS, as it's element type is
simplest(int).
#self.DATA_LENGTH = sum([len(sublist) for sublist in self.LABELS])
#print(f"Preparing Data Complete.")

# ---------------------- Testing/Predicting on input image --------------------

def load_for_testing(input_filename, row_number):


# (1) load & preprocess input.
df = pd.read_csv(input_filename)
row = df.iloc[[row_number-1][0]]
input_image = cv2.imread("../dataset/images/" + str(row["ImageID"]) + ".jpg")
candidates, normalized_candidates = generate_region_proposals(input_image)
# get all RPs in batch.
images_batch = None
for candidate in candidates:
x_min = candidate[0]
y_min = candidate[1]
x_max = candidate[2]
y_max = candidate[3]
# cv2 internally uses numpy arrays. numpy expects format -> [y:y+h, x:x+w]
sub_img = input_image[y_min:y_max, x_min:x_max]
rp_img = sub_img.copy() # create an independent copy of region proposal
sub image to return.
rp_img = rp_img/255 # convert range between 0-1.
rp_img = BusTruckDataset.preprocess(rp_img)
if images_batch == None:
images_batch = torch.unsqueeze(rp_img, 0) # use 1st image as
initial tensor. Use unsqueeze() to create a batch dimension.
else:
images_batch = torch.cat((images_batch, torch.unsqueeze(rp_img, 0)),
dim=0) # concatenate along the batch(0th) dimension.
gt_bb = torch.tensor([row["XMin"], row["YMin"], row["XMax"], row["YMax"]])
# ground truth BB.
#print(images_batch.shape)
return input_image, images_batch, normalized_candidates, gt_bb,
row["LabelName"]

179
# gets the best detections in order of highest probabilities first. in the returned
list.
def get_best_detections(input_image, normalized_candidates,
class_predictions_for_RPs, bb_prediction):
# (1) get the probabilities & score to be used later, in nms.
class_probabilities = torch.nn.functional.softmax(class_predictions_for_RPs, -
1)
# torch.max() returns max values of each row in given dimension, & their
indices(convert probabilities to classes).
class_prob_scores, class_predictions_for_RPs = torch.max(class_probabilities, -
1)

# (2) extract predictions, their RPs & BB predictions; that do not belong to
background class.
class_predictions_indices = torch.where(class_predictions_for_RPs != 0)

class_predictions_for_RPs =
class_predictions_for_RPs[class_predictions_indices]
normalized_candidates =
torch.tensor(normalized_candidates[class_predictions_indices])
class_prob_scores = class_prob_scores[class_predictions_indices].detach()
bb_prediction = bb_prediction[class_predictions_indices].detach()

# (3) perform non-maximum suppression. NMS iteratively removes lower scoring


boxes which have an IoU greater than iou_threshold(0.05 below) with another (higher
scoring) box. nms() returns the indices of the elements/BBs that have been retained by
NMS, sorted in decreasing order of scores(highest score first).
nms_indices = nms(normalized_candidates, class_prob_scores, 0.05)

# return the top predicted BBs after performing nms().


best_bbs = []
best_class_predictions = []
best_class_prob_scores = []
for best_index in nms_indices:
best_class_predictions.append(class_predictions_for_RPs[best_index])
curr_best_bb_offset = bb_prediction[best_index]
normalized_RP_for_offset = normalized_candidates[best_index]
# compute the BB from RP location & BB's predicted offset
(curr_best_bb_offset).
curr_best_bb = normalized_RP_for_offset + curr_best_bb_offset
# add the computed BB to list.
best_bbs.append(curr_best_bb)

180
best_class_prob_scores.append(class_prob_scores[best_index])
return best_class_predictions, best_bbs, best_class_prob_scores

CommonObjDetectionFunctions:

# this file contains functionalities common to Objection Detection, that can be reused
again.

import numpy as np
import selectivesearch

# generates & returns the region proposals (in both raw and normalized form) for a
given image.
def generate_region_proposals(img, pScale=200, pMin_size=100):
# selectivesearch() requires numpy array as input.
img_label, regions = selectivesearch.selective_search(np.array(img), scale=pScale,
min_size=pMin_size)
img_area = img.shape[0] * img.shape[1] # area = width*height.
RP_candidates = [] # holds region proposal candidates.
for r in regions:
if r["rect"] in RP_candidates: # ignore, if RP(tuple form - x,y,w,h) is
duplicate.
continue
if r["size"] < (0.05 * img_area): # ignore, if size is too small.
continue
if r["size"] > (1*img_area): # ignore, if size is too large.
continue
# add this rect (RP) to candidates list.
RP_candidates.append(r["rect"])
# convert x,y,w,h to x1,y1,x2,y2 i.e. points format.
RP_candidates = [(x1, y1, x1+x2, y1+y2) for (x1, y1, x2, y2) in RP_candidates]
# normalize candidates(bb1) values within image boundaries i.e. from pixel values
in image dimensions; to 0-1 range. Note that ground truth BB(bb2) is in normalized
range. Having both BB & candidates in normalized/same range will be required while
calculating IoU.
width = img.shape[1]
height = img.shape[0]
RP_candidates_normalized = RP_candidates / np.array([width, height, width, height])
return RP_candidates, np.float32(RP_candidates_normalized)

181
# This function computes & returns the IoU between 2 bounding boxes bb1 & bb2. Returns
0 if they do not overlap.
def compute_iou(bb1, bb2, epsilon=1e-5):
# get sub rect that's overlapping (if overlap exists).
x1 = max(bb1[0], bb2[0])
y1 = max(bb1[1], bb2[1])
x2 = min(bb1[2], bb2[2])
y2 = min(bb1[3], bb2[3])
width = x2 - x1
height = y2 - y1
# If no overlap, return 0.
if(width < 0) or (height < 0):
return 0
area_overlap = width * height # get area of overlap (Intersection).
# compute individual areas.
area_1 = (bb1[2] - bb1[0]) * (bb1[3] - bb1[1])
area_2 = (bb2[2] - bb2[0]) * (bb2[3] - bb2[1])
# get total area (union).
area_combined = area_1 + area_2 - area_overlap
iou = area_overlap / (area_combined + epsilon) # get IoU.
return np.float32(iou)

# this function determines the class to be assigned to each RP, by checking it's IoUs
for a threshold. The returned value is the class index.
def assign_classes_via_IoU(iou, label_class, labels):
THRESHOLD = 0.3
if iou > THRESHOLD:
return labels.index(label_class) # assign class from database.
else:
return 0 # return Background class index.

# this function calculates and returns the difference between candidate BB & ground
truth BB, for all candidates. If the assigned class if of type background, store delta
as '0'.
def compute_BB_offsets(normalized_candidates, bb, curr_RP_classes):
deltas = [] # contains the results.
for index, candidate in enumerate(normalized_candidates):
if curr_RP_classes[index] == 0:

182
deltas.append([0,0,0,0]) # assign 0 if assigned class type is
background.
else:
deltas.append(bb - candidate) # compute difference between BB & RP
(in that order) & store the offset.
return deltas

ObjDetectModel:

import torch
from torchvision import models
import torch.nn as nn
from ObjDetectSubModule import ObjDetectSubModule

class ObjDetectModel(nn.Module):
#model_name = "trained_models/ObjDetectModel_VGG16.pth"
def __init__(self) -> None:
super().__init__()
self.model_name = "trained_models/ObjDetectModel_VGG16.pth"
self.model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
# disable learning on all layers.
for param in self.model.parameters():
param.requires_grad = False
#summary(model, torch.zeros(1, 3, 224, 224))
# override layers. Input to model.avgpool = (512,7,7)
#self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, kernel_size=3),
nn.MaxPool2d(2), nn.ReLU(), nn.Flatten()) # input image size = 7x7. Conv2d = 7-
2=5, Max2D = 5/2=2. output = 512*2*2 = 2048.
# Not modifying "self.model.avgpool" above; so that inputs to classifier below
has more number of nodes (~25k) might result in better BB prediction, as per the book
example.
#self.model.classifier = ObjDetectSubModule(2048)

# output of the default (adaptive) avgpool = 512 *7*7=25088, which is now


passed to our custom "ObjDetectSubModule" classifier; according to book example.
self.model.classifier = ObjDetectSubModule(25088)

def forward(self, input):

183
class_predictions, bb_predictions = self.model(input)
return class_predictions, bb_predictions

ObjDetectSubModule:

import torch
import torch.nn as nn
from BusTruckDataset import BusTruckDataset

class ObjDetectSubModule(nn.Module):
def __init__(self, features_dim) -> None:
super().__init__()
# label classifier: output nodes = number of labels.
#self.label_classifier = nn.Sequential(nn.Linear(features_dim,
len(BusTruckDataset.labels)), nn.Sigmoid())
self.label_classifier = nn.Sequential(nn.Linear(features_dim, 6272), nn.ReLU(),
nn.Linear(6272, 1568), nn.ReLU(), nn.BatchNorm1d(1568), nn.Linear(1568, 392),
nn.ReLU(), nn.Linear(392, len(BusTruckDataset.labels)), nn.Sigmoid()) #
6272/4=1568, 1568/4=392.

# object detection: 4 corners of the bounding box - normalized between 0 & 1.


Using Tanh() because output(BB offset) will be between -1 to 1; as offsets can be -ve
values too.
#self.detector = nn.Sequential(nn.Linear(features_dim, 512), nn.ReLU(),
nn.Linear(512, 4), nn.Tanh())
self.detector = nn.Sequential(nn.Linear(features_dim, 6272), nn.ReLU(),
nn.Linear(6272, 1568), nn.BatchNorm1d(1568), nn.ReLU(), nn.Linear(1568, 392),
nn.ReLU(), nn.Linear(392, 4), nn.Tanh())

def forward(self, input):


# diversify output predictions for object classification & detection(bounding
box), & return both predictions.
class_prediction = self.label_classifier(input)
bb_prediction = self.detector(input)
return class_prediction, bb_prediction

Train_validate:

184
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
import math

device = "cuda" if torch.cuda.is_available() else "cpu"

# This custom loss function penalizes more for predictions where actual classes are
not predicted correctly(or predicted as background), & less where actual background
classes are predicted as non-background classes.
def custom_classification_loss_fn(class_preds, class_labels, classification_lossFn):
# 'background_loss' - when background is wrongly predicted as classes.
class_indices = torch.where(class_labels == 0)
# get predictions for indices where labels are background.
class_preds_background = class_preds[class_indices]
class_labels_background = class_labels[class_indices]
# compute loss i.e. when classes are incorrectly predicted as non-background.
background_loss = classification_lossFn(class_preds_background,
class_labels_background)

# 'incorrect_class_loss' - when classes are wrongly predicted as background or


other classes. Give higher weightage to this.
class_indices = torch.where(class_labels != 0)
class_preds_classes = class_preds[class_indices]
class_labels_classes = class_labels[class_indices] # get (non-background)
values for labels.
weightage = 5
# compute loss i.e. when classes are incorrectly predicted either as background, or
other classes.
incorrect_class_loss = classification_lossFn(class_preds_classes,
class_labels_classes)
return background_loss + (weightage * incorrect_class_loss)

def train(epochs, model:nn.Module, dLoader:DataLoader, classification_lossFn,


BB_lossFn, optimFn:optim):
model.train()
for epoch in range(epochs):

185
# accumulated losses over all batches in an epoch.
acc_classification_loss = 0
acc_bb_loss = 0
acc_total_loss = 0
num_iterations = 0
for _, [input_images, class_labels, gt_bb_offsets] in enumerate(dLoader):
optimFn.zero_grad()
class_predictions, bb_predictions = model(input_images) # "input_images"
are the 'Proposed Region'.

# calculate individual losses.


# (1) compute classification loss.
classification_loss = custom_classification_loss_fn(class_predictions,
class_labels, classification_lossFn) #classification_lossFn(class_predictions,
class_labels)

# (2) COMPUTE BB LOSS ONLY WHERE LABEL != 0 (i.e. background).


Classification loss is to be used for all classes though.
class_indices = torch.where(class_labels != 0) # 'class_indices'
contains those indices where labels != 0.
# use only values from 'class_indices' indices for BB i.e. regression
losses, as these contain losses for BBs of actual classes (bus/Truck). Ignore BB
predictions for background class i.e. do not compute loss of BB predictions for
background class.
bb_predictions = bb_predictions[class_indices]
# do similarly for ground truth labels.
gt_bb_offsets = gt_bb_offsets[class_indices]
bb_loss = BB_lossFn(bb_predictions, gt_bb_offsets)

# compute total loss.


bb_loss_weightage = 10 # give more weightage/penalty to BB loss.
total_loss = classification_loss + (bb_loss * bb_loss_weightage) #
(classification_loss + 10 * bb_loss) in book(self.lmb=10, in notebook link).
total_loss.backward()
optimFn.step()
# accumulate the losses.
acc_classification_loss += classification_loss.item()
acc_bb_loss += bb_loss.item()
acc_total_loss += total_loss.item()
num_iterations += 1
print(f"Epoch:{epoch+1} - Training: Classification Loss =
{(acc_classification_loss/num_iterations):0.5f} , BB Detection Loss =

186
{(acc_bb_loss/num_iterations):0.5f} , Total Loss =
{(acc_total_loss/num_iterations):0.5f}")

def validate(model:nn.Module, dLoader:DataLoader, BB_lossFn):


model.eval()
total_classification_accuracy = 0
total_bb_loss = 0
total_iterations = 0
with torch.no_grad():
for _, [input_images, class_labels, gt_bb_offsets] in enumerate(dLoader):
class_predictions, bb_predictions = model(input_images)
# calculate individual losses.
background_accuracy_ratio, correct_class_accuracy_ratio,
total_accuracy_ratio = compute_accuracy(class_predictions, class_labels)

class_indices = torch.where(class_labels != 0)
bb_predictions = bb_predictions[class_indices]
gt_bb_offsets = gt_bb_offsets[class_indices]
bb_loss = BB_lossFn(bb_predictions, gt_bb_offsets)

# store losses for averaging.


total_classification_accuracy += total_accuracy_ratio
# assign bb_loss = 0, if it's value is NaN.
bb_loss_value = 0
if math.isnan(bb_loss.item()):
bb_loss_value = 0
else:
bb_loss_value = bb_loss.item()

total_bb_loss += bb_loss_value
total_iterations += 1
print(f"Validation per Batch: Classification Accuracy:(Background Accuracy:
{background_accuracy_ratio:0.5f}, Correct Class Accuracy:
{correct_class_accuracy_ratio:0.5f}, Total Accuracy: {total_accuracy_ratio:0.5f}) , BB
Detection Loss = {bb_loss_value:0.5f}")
print("\n")
print(f"Average: Classification Accuracy =
{total_accuracy_ratio/total_iterations} , BB Detection Loss =
{total_bb_loss/total_iterations}")
print("\n")

187
# computes the accuracy for categorical variable/field i.e. classification.
def compute_accuracy(predictions, labels):
total = labels.shape[0] # basically batch_size (rows)
predictions = torch.argmax(predictions, dim=1)

# get accuracy of background classes being correctly predicted.


class_indices = torch.where(labels == 0)
class_preds_background = predictions[class_indices]
class_labels_background = labels[class_indices]
background_accuracy = (class_preds_background == class_labels_background).sum()
background_accuracy_ratio = background_accuracy / len(class_labels_background)

# get accuracy of (non-background) classes being correctly predicted.


class_indices = torch.where(labels != 0)
class_preds_classes = predictions[class_indices]
class_labels_classes = labels[class_indices]
correct_class_accuracy = (class_preds_classes == class_labels_classes).sum()
correct_class_accuracy_ratio = correct_class_accuracy / len(class_labels_classes)

# get overall accuracy.


total_correct_predictions = (predictions == labels).sum()
total_accuracy_ratio = total_correct_predictions.item() / total
return background_accuracy_ratio.item(), correct_class_accuracy_ratio.item(),
total_accuracy_ratio

Main:

import torch
import torch.nn as nn
from BusTruckDataset import BusTruckDataset
from torch.utils.data import DataLoader
from ObjDetectModel import ObjDetectModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2
from torchvision import transforms

188
# link: https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/blob/
master/Chapter07/Training_RCNN.ipynb

device = "cuda" if torch.cuda.is_available() else "cpu"

ROWS = 1000 #1000 # number of rows to use from dataset, since dataset is huge;
& generating Region Proposals adds significantly to even small number of data.

def run_training():
epochs = 5 #20
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, shuffle=True, drop_last=True)
model = ObjDetectModel().to(device)
label_lossFn = nn.CrossEntropyLoss()
delta_lossFn = nn.MSELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Started-----------------")
train(epochs, model, dLoader, label_lossFn, delta_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

def run_validation():
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", int(ROWS/5),
False)
dLoader = DataLoader(dataset, batch_size=100, shuffle=False, drop_last=True)
model = ObjDetectModel().to(device)

189
result = model.load_state_dict(torch.load(model.model_name))
delta_lossFn = nn.MSELoss()
print("\n")
print("------------Validation Started-----------------")
validate(model, dLoader, delta_lossFn)
print("------------Validation Complete-----------------")
print("\n")

# to resume training.
def resume_training(epochs = 20):
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, shuffle=True, drop_last=True)
model = ObjDetectModel().to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
label_lossFn = nn.CrossEntropyLoss()
delta_lossFn = nn.MSELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, label_lossFn, delta_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

# converts tensor to opencv image format(numpy array) by rearranging dimensions before


conversion.
def convert_tensor_image_to_opencv(tensor_img):
# 'tens_img' format = CxHxW (if using transforms.ToTensor()).
tensor_img = torch.permute(tensor_img,dims=(1,2,0))
tensor_img *= 255

190
return tensor_img.numpy().astype(np.uint8) # format = HxWxC (1,2,0), as
required by cv2 images.

# test_on_input() performs prediction on a single input image.


def test_on_input(input_filename, row_number):
# (1) get original input image, RP in batches & ground truth BB.
input_image, images_batch, normalized_candidates, gt_bb, gt_label =
BusTruckDataset.load_for_testing(input_filename, row_number)

# (2) load model.


model = ObjDetectModel().to(device)
# load previously trained model.
result = model.load_state_dict(torch.load(model.model_name))
model.eval()
# perform prediction.
class_predictions_for_RPs, bb_prediction = model(images_batch)

best_class_predictions, best_bbs, class_prob_scores =


BusTruckDataset.get_best_detections(input_image, normalized_candidates,
class_predictions_for_RPs, bb_prediction)

# scale BB wrt original image dimensions.


img_width = input_image.shape[1]
img_height = input_image.shape[0]
image_dims = np.array([img_width, img_height, img_width, img_height]) # BB scaler
wrt original image.

########### TESTING ON SCALED (224) IMAGE.


"""
# show above detections on resized input image.
resized_input_image = input_image.copy()
trns_op = transforms.Compose([transforms.ToTensor(),
transforms.Resize(BusTruckDataset.resize_value),
transforms.CenterCrop(BusTruckDataset.resize_value)])
resized_input_image = trns_op(resized_input_image) # format = CxHxW
resized_input_image = convert_tensor_image_to_opencv(resized_input_image)
#cv2.imshow("window1", resized_input_image) # testing, show image in cv2
window.
#cv2.waitKey(0)
input_image = resized_input_image.copy()
"""

191
##########

color = (0,255,255) # yellow color for best index.


best_predicted_label = None
for index, bb in enumerate(best_bbs):
bb *= image_dims
# get label name for predicted class number.
best_predicted_label = BusTruckDataset.labels[best_class_predictions[index]]
# get score for this predicted label.
predicted_label_score = class_prob_scores[index].item()
# Testing only - print prediction info.
print(f"Best Class label: {best_predicted_label} , Score:
{predicted_label_score}")
# draw text on image.
input_image = draw_bb_on_image(input_image, bb, color, img_width, img_height,
best_predicted_label, predicted_label_score)
color = (0,0,255) # red color for rest of the predicted BBs.

# draw actual ground truth BB (green color) on image.


gt_bb *= image_dims
input_image = draw_bb_on_image(input_image, gt_bb, (0,255,0), img_width,
img_height, gt_label, 1)

cv2.imshow("window1", input_image)
cv2.waitKey(0)
print("Prediction completed.")

# draws a rectangle on image & returns the image.


def draw_bb_on_image(img, bb, color, img_width, img_height, best_predicted_label,
predicted_label_score):
# perform validations.
startX = int(bb[0])
startX = startX if startX >=0 else 0
startY = int(bb[1])
startY = startY if startY >=0 else 0
startPoint = (startX, startY)
endX = int(bb[2])
endX = endX if endX <=img_width else img_width
endY = int(bb[3])
endY = endY if endY <=img_height else img_height
endPoint = (endX, endY)

192
# draw image.
img = cv2.rectangle(img, startPoint, endPoint, color, 1)
# display both label & it's predicted probability/score above BB in image.
if predicted_label_score == 1: # for ground truth BB, do not show score.
label_description = best_predicted_label
else:
label_description = best_predicted_label + " (" +
f"{predicted_label_score:0.3}" + ")"
img = cv2.putText(img, label_description, (startX, startY-5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
return img

#run_training()

#resume_training(10)

#run_validation()

#test_on_input("../dataset/df.csv", 9000) # extract the 9000th row from csv for


testing.
#test_on_input("../dataset/df.csv", 9102)
#test_on_input("../dataset/df.csv", 9202)
#test_on_input("../dataset/df.csv", 2000)

# images upto 1000, already in training. Detection performs well in training images.
#test_on_input("../dataset/df.csv", 500)
test_on_input("../dataset/df.csv", 600)

RCNN vs Fast-RCNN vs Faster-RCNN:

RCNN vs Fast RCNN:

In Fast-RCNN, instead of feeding the region proposals to the CNN (as in RCNN),
we:
(a) feed the input image to a CNN (a pretrained model) to generate a
convolutional feature map.

193
(b) From the convolutional feature map, we identify the region of proposals
(via selectivesearch) and warp them into squares and by using a RoI (Region of
Interest) pooling layer to reshape them into a fixed size, so that it can be fed into
a fully connected layer. This is a replacement for the warping that was executed
in the R-CNN technique.
(c) From the RoI feature vector, we use a softmax layer to predict the class of
the proposed region and also the offset values for the bounding box.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have
to feed “N” region proposals to the convolutional neural network every time.
It takes the whole image and region proposals as input in its CNN
architecture in one forward propagation.
The convolution operation is done only once per image and a feature map
is generated from it, then generating region proposals from this single feature
map & passing (thereby avoiding the need to pass each resized RP through the
Conv2D) through the model.
When you look at the performance of Fast R-CNN during testing time,
including region proposals slows down the algorithm significantly when compared
to not using region proposals. Therefore, region proposals become bottlenecks in
the Fast R-CNN algorithm affecting its performance.

R-CNN vs Fast-RCNN.

Faster R-CNN:

Both of the above algorithms (R-CNN & Fast R-CNN) use selective search to find
out the region proposals. Selective search is a slow and time-consuming process
affecting the performance of the network.

194
Faster RCNN eliminates the selective search algorithm and lets the
network learn the region proposals. Similar to Fast R-CNN, the image is provided
as an input to a convolutional network which provides a convolutional feature
map. Instead of using selectivesearch algorithm on the feature map to identify
the region proposals, a separate network (RPN - Region Proposal Network) is
used to predict the region proposals. The predicted region proposals are then
reshaped using a RoI pooling layer which is then used to classify the image
within the proposed region and predict the offset values for the bounding boxes.
Faster RCNN is much faster than it's predecessors, hence it can even be
used for real-time object detection.

Anchor Boxes:

Anchor Boxes are a handy replacement for selectivesearch that was used to
compute RPs.
When we have a decent idea of the width, height & aspect ratio
(height/width) of the objects in our dataset (from the provided ground truth BB),
we define the anchor boxes with heights and widths representing the majority of
object’s bounding boxes within our dataset (can be obtained by employing K-
means clustering on top of the ground truth bounding boxes).

Usage:
(a) Slide each anchor box over an image from top left to bottom right.
(b) The anchor box that has a high (above a threshold) intersection over union
(IoU) with the object will have a label that mentions that it contains an object, and
the others will be labeled 0.

Once we obtain the ground truths as defined here, we can build a model that can
predict the location of an object and also the offset corresponding to the anchor
box to match it with ground truth.

Region Proposal Network (RPN):

RPN leverages anchor boxes to come up with predictions of RPs.

195
For each stride/slide of an anchor box in the image, we feed the image
crop (crop a sub-image at, & equal to anchor box) to the RPN, indicating whether
the crop contains an object.
Essentially, an RPN suggests the likelihood of a crop containing an object.

While region proposal generation based on selectivesearch is done


outside of the neural network, we can build an RPN that is a part of the object
detection network. Using an RPN, we are now in a position where we don't have
to perform unnecessary computations to calculate region proposals outside of
the network. This way, we have a single model to identify regions, identify
classes of objects in image, and identify their corresponding bounding box
locations.

We take each region candidate and compare with the ground truth bounding
boxes of objects in an image to identify whether the IoU between a region
candidate and a ground truth bounding box is greater than a certain threshold. If
the IoU is greater than a certain threshold (say, 0.5), the region candidate
contains an object, and if the IoU is less than a threshold (say 0.1), the region
candidate does not contain an object and all the candidates that have an IoU
between the two thresholds (0.1 - 0.5) are ignored while training.
Once we train a model to predict if the region candidate contains an
object, we then perform non-max suppression, as multiple overlapping regions
can contain an object.

Explanation of how RPN uses Anchor Boxes:

Below is a sketch that shows how a 3x3 sliding window (red colour) of the
RPN is applied on some location (blue dot) of a feature map with 512 channels.

For each feature-map location (blue dot) there is associated a set of 𝑘

196
anchor boxes of fixed scales and aspect ratios. The regression head of the RPN
outputs 4 values 𝑡𝑥,𝑡𝑦,𝑡𝑤,𝑡ℎ for each anchor box, which are then used to resize
and move the center of each anchor box to get a region-proposal (together with
objectness score obtained by the classification branch (softmax classifier) of the
RPN).

RPNs use anchor boxes that serve as references at multiple scales and
aspect ratios. The scheme can be thought of as a pyramid of regression
references, which avoids enumerating images or filters of multiple scales or
aspect ratios.

Region Of Interest (RoI) Pooling:

Region of interest pooling (also known as RoI pooling) is an operation widely


used in object detection tasks using convolutional neural networks. For example,
to detect multiple cars and pedestrians in a single image. Its purpose is to
perform max pooling on inputs of nonuniform sizes to obtain fixed-size feature
maps (e.g. output of 7×7 from variable input sizes).

The layer takes two inputs:


- A fixed-size feature map obtained from a deep convolutional network
with several convolutions and max pooling layers.
- An N x 5 matrix representing a list of regions of interest, where N is a
number of RoIs. The first column represents the image index and the remaining
four are the coordinates of the top left and bottom right corners of the region.

For every region of interest from the input list, it takes a section of the input
feature map that corresponds to it and scales it to some predefined (output) size
(e.g., 7×7).

The scaling is done by:


- Dividing the region proposal into equal-sized sections (the number of
which is the same as the dimension of the output).
- Finding the largest value in each section.
- Copying these max values to the output buffer.

197
Thus, the output buffer for each RP (always of the same size) will contain
the max values in that section.
The major hurdle for going from image classification to object detection is fixed
size input requirement to the network because of existing fully connected layers.
In object detection, each proposal will be of a different shape. So there is a need
for converting all the proposals to fixed shape as required by fully connected
layers. ROI Pooling does exactly this.

Idea of RoIPooling - performing region of interest pooling on a single 8×8 feature


map, one region of interest and an output size of 2×2(this process is also called
quantization). We divide it into (2×2) sections (black lines - because the output
size is 2×2, Notice that the size of the region of interest doesn’t have to be
perfectly divisible by the number of pooling sections (in this case our RoI (black
rectangle) is 7×5 and we have 2×2 pooling sections)).

The result of RoIPooling is that; from a list of rectangles with different sizes, we
can quickly get a list of corresponding feature maps with a fixed size. Note that

198
the dimension of the RoI pooling output doesn’t actually depend on the size of
the input feature map nor on the size of the region proposals. It’s determined
solely by the number of sections we divide the proposal into.
What’s the benefit of RoI pooling? One of them is processing speed. If there are
multiple object proposals on the frame (and usually there’ll be a lot of them), we
can still use the same input feature map for all of them. Since computing the
convolutions at early stages of processing is very expensive, this approach can
save us a lot of time.

What are the most important things to remember about RoI Pooling?

- It’s used for object detection tasks.


- It allows us to reuse the feature map from the convolutional network.
- It can significantly speed up both train and test time.
- It allows to train object detection systems in an end-to-end manner.

Working of an (Faster-RCNN) Object detection Model.

Faster-RCNN contains the following submodules:


- GeneralizedRCNNTransform is a simple resize followed by a normalize
transformation.
- BackboneWithFPN is a neural network that transforms input into a feature map.
- RegionProposalNetwork generates the anchor boxes for the preceding feature
map and predicts individual feature maps for classification and regression tasks.

199
- RoIHeads takes the preceding maps, aligns them using RoI pooling, processes
them, and returns classification probabilities for each proposal and the
corresponding offsets.

2) Example code for Faster-RCNN object detection on Bus-Truck:

Dataset:

import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
import numpy as np
from torchvision.ops import nms # for non-maximum suppression.

device = "cuda" if torch.cuda.is_available() else "cpu"

class BusTruckDataset(Dataset):
resize_value = 224
labels = [] #["Background", "Bus", "Truck"]
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(resize_value), transforms.CenterCrop(resize_value),
transforms.ConvertImageDtype(torch.float32)])

def __init__(self, images_path, dataset_path, Rows, train) -> None:


super().__init__()
self.image_path = images_path # path that points to the actual images.
self.data = []
# prepare the data to be loaded.
self.prepare_data(dataset_path, Rows, train)

def __len__(self):
return len(self.data)

200
def __getitem__(self, index):
image_name, target = self.data[index]
# read entire image & preprocess it.
img = cv2.imread(self.image_path + image_name + ".jpg") # read entire
image.
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)/255 # reorder channels &
convert values to range [0,1].
img = BusTruckDataset.preprocess(img) # perform preprocessing.
# return image.
return img.to(device), target

def collate_fn(self, batch):


# '*' is the 'splat' operator. It is used for unpacking a list into arguments.
For example: foo(*[1, 2, 3]) is the same as foo(1, 2, 3).
# The zip() function takes n iterables, and returns y tuples, where y is the
least of the length of all of the iterables provided. The yth tuple will contain the
yth element of all of the iterables provided. Ex: zip(['a', 'b', 'c'], [1, 2, 3]) ->
('a', 1) ('b', 2) ('c', 3).
# This will help to reorganize the batch data such that all inputs have same
size, as required by Dataloaders.
return tuple(zip(*batch))

# gets the unique classes/labels from dataset.


def get_output_classes(df):
output_classes = []
labelCols = df["LabelName"]
output_classes = labelCols.unique().tolist()
output_classes.insert(0, "Background") # set "Background" as the first
class.
return output_classes

# prepares data, to be consumed by thisDataset. prepare_data() is somewhat specific


to the data format provided in csv dataset file.
def prepare_data(self, input_csv_file, Rows, train):
# 24062 rows in total, in csv dataset.
if train:
# for training, use first 'Rows'.
df = pd.read_csv(input_csv_file, nrows=Rows)
else:
# for validation, use last 'Rows'.
df = pd.read_csv(input_csv_file)

201
df = df.tail(Rows)
# get output label classes.
BusTruckDataset.labels = BusTruckDataset.get_output_classes(df)

# get unique images, that we will later iterate, to get all entries for each
unique image.
self.unique_images = df["ImageID"].unique()

# Start extracting data & presenting it in desired form.


for image_id in self.unique_images:
# get all data for "image_id" in dataset.
data = df[df["ImageID"] == image_id]

labels = data['LabelName'].values.tolist() # [<label>, <label>,


<label>,...]
BBs = data[['XMin','YMin','XMax','YMax']].values # [array[x1,y1,x2,y2],
array[x1,y1,x2,y2],...]
# convert to image dimensions. BB prediction output from model, will also
be in absolute values. Better if we multiply by actual image's width & height values,
instead of 'resize_value'(224).
BBs[:,[0,2]] *= BusTruckDataset.resize_value
BBs[:,[1,3]] *= BusTruckDataset.resize_value
boxes = BBs.astype(np.uint32).tolist() # convert to whole numbers.
# create dictionary, as torch FRCNN expects ground truths as a dictionary
of tensors.
target = {}
target["boxes"] = torch.Tensor(boxes).float().to(device) # 2D tensor of
dims(<num of BBs>,4)
# if a single image contains multiple objects, labels will be multiple
elements in "target" dict. Need to handle this in collate function, as pytorch
requires tensors to be of same size.
target["labels"] = torch.Tensor([BusTruckDataset.labels.index(labelName)
for labelName in labels]).long().to(device) # 1D tensor of "labels"
corresponding to BBs on target["boxes"].
self.data.append((image_id, target)) # append image name & it's data
as a tuple, in self.data.
#print(len(self.data)) # length should be equal to length of
"self.unique_images".

# ---------------------- Testing/Predicting on input image --------------------

202
def load_for_testing(input_filename, row_number):
# (1) load & preprocess input.
df = pd.read_csv(input_filename)
BusTruckDataset.labels = BusTruckDataset.get_output_classes(df)
row = df.iloc[[row_number-1][0]]
input_image = cv2.imread("../dataset/images/" + str(row["ImageID"]) + ".jpg")
width = input_image.shape[1]
height = input_image.shape[0]
input_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)/255
input_image = BusTruckDataset.preprocess(input_image)
gt_bb = torch.tensor([row["XMin"], row["YMin"], row["XMax"], row["YMax"]])
gt_bb *= np.array([width, height, width, height])
return input_image, row["LabelName"], gt_bb

# The output of the trained Faster-RCNN model contains boxes, labels, and scores
corresponding to classes.
def decode_outputs(outputs):
# extract bounding boxes from outputs structure.
outputs = outputs[0]
bbs = outputs["boxes"].cpu().detach()
# extract label names.
labels = outputs["labels"].cpu()
# extract confidence scores.
scores = outputs["scores"].cpu().detach()
nms_indices = nms(bbs, scores, 0.05)
# retain only those values that correspond to nms_indices.
bbs = bbs[nms_indices]
labels = labels[nms_indices]
scores = scores[nms_indices]
return bbs.tolist(), labels.tolist(), scores.tolist()

Model:

import torch
from torchvision import models
import torch.nn as nn
import torchvision

203
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor # FRCNN model
- provides a model that has been pre-trained with the COCO dataset using ResNet50.

#
https://pytorch.org/vision/main/models/generated/torchvision.models.detection.fasterrc
nn_resnet50_fpn.html

"""
Faster-RCNN contains the following submodules:
- GeneralizedRCNNTransform is a simple resize followed by a normalize transformation.
GeneralizedRCNNTransform(
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Resize(min_size=(800,), max_size=1333, mode='bilinear')
)
- BackboneWithFPN is a neural network that transforms input into a feature map.
- RegionProposalNetwork generates the anchor boxes for the preceding feature map and
predicts individual feature maps for classification and regression tasks.
RegionProposalNetwork(
AnchorGenerator()
RPNHead(
conv(): (Conv2D(256, 256, kernel_size(3,3)), stride=(1,1), padding=(1,1))
cls_logits(): (Conv2D(256, 3, kernel_size(1,1)), stride=(1,1))
bbox_pred(): (Conv2D(256, 12, kernel_size(1,1)), stride=(1,1))
)
)
- RoIHeads takes the preceding maps, aligns them using RoI pooling, processes them,
and returns classification probabilities for each proposal and the corresponding
offsets.
RoIHeads(
(box _ roi _pool): MultiScaleRolAlign()
(box head): TwoMLPHead( (fc6): Linear(infeatures=12544, out features=1024,
bias=True)
(fc7): Linear(in_features=1024, out_features=1024, bias=True)
)
(box_predictor): FastRCNNPredictor(
(cls score): Linear(in features=1024, out features=2, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
)
)
"""

204
class FRCNNModel(nn.Module):
def __init__(self, num_classes) -> None:
super().__init__()
self.model_name = "trained_models/ObjDetectModel_FRCNN.pth"
self.model =
torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=models.detection.FasterRC
NN_ResNet50_FPN_Weights.DEFAULT)
# disable learning on all layers.
#summary(model, torch.zeros(1, 3, 224, 224))
in_features = self.model.roi_heads.box_predictor.cls_score.in_features
# The predictor is what that outputs the classes and the corresponding bboxes.
# num_classes = number of classes + 1(for background).
self.model.roi_heads.box_predictor = FastRCNNPredictor(in_features,
num_classes)

def forward(self, input, targets=None):


#class_predictions, bb_predictions = self.model(input)
if targets == None:
losses = self.model(input)
else:
losses = self.model(input, targets)
return losses

train-validate:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
import math

device = "cuda" if torch.cuda.is_available() else "cpu"

def train(epochs, model:nn.Module, dLoader:DataLoader, optimFn:optim):


model.train()
for epoch in range(epochs):
num_iterations = 0

205
total_loc_loss = 0; total_regr_loss = 0; total_loss_objectness = 0;
total_loss_rpn_box_reg = 0
total_loss = 0
for _, [input_images, targets] in enumerate(dLoader):
optimFn.zero_grad()
losses = model(input_images, targets) # 'losses' is a dictionary of
losses.
# compute total loss.
loss = sum(loss for loss in losses.values()) # can also customize how
individual losses are assigned weightage.

loss.backward()
optimFn.step()
num_iterations += 1
loc_loss, regr_loss, loss_objectness, loss_rpn_box_reg = [losses[k] for k
in ['loss_classifier','loss_box_reg','loss_objectness','loss_rpn_box_reg']]
# accumulate individual losses over iterations.
total_loc_loss += loc_loss.item()
total_regr_loss += regr_loss.item()
total_loss_objectness += loss_objectness.item()
total_loss_rpn_box_reg += loss_rpn_box_reg.item()
total_loss += loss.item()
print(f"Epoch:{epoch+1} - Training: Location Loss =
{(total_loc_loss/num_iterations):0.5f}, Regression Loss =
{total_regr_loss/num_iterations:0.5f}, Objectness Loss =
{total_loss_objectness/num_iterations:0.5f}, RPN box Regression Loss =
{total_loss_rpn_box_reg/num_iterations:0.5f}, Total loss =
{total_loss/num_iterations:0.5f}")

def validate(model:nn.Module, dLoader:DataLoader, optimFn:optim):


#model.eval()
model.train() # NOTE: to get losses, we need to use train() mode only. This means
that model outputs different data in train() & eval() mode respectively.
with torch.no_grad():
for _, [input_images, targets] in enumerate(dLoader):
optimFn.zero_grad()
losses = model(input_images, targets)
loss = sum(loss for loss in losses.values())

206
loc_loss, regr_loss, loss_objectness, loss_rpn_box_reg = [losses[k] for k
in ['loss_classifier','loss_box_reg','loss_objectness','loss_rpn_box_reg']]
#total_iterations += 1
print(f"Validation per Batch Loss: Location Loss =
{(loc_loss.item()):0.5f}, Regression Loss = {regr_loss.item():0.5f}, Objectness Loss =
{loss_objectness.item():0.5f}, RPN box Regression Loss =
{loss_rpn_box_reg.item():0.5f}, Total loss = {loss.item():0.5f}")
print("\n")

main:

import torch
from BusTruckDataset import BusTruckDataset
from torch.utils.data import DataLoader
from FRCNNModel import FRCNNModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2

# link: https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/blob/
master/Chapter08/Training_Faster_RCNN.ipynb

device = "cuda" if torch.cuda.is_available() else "cpu"

ROWS = 3000 #1000 # number of rows to use from dataset, since dataset is huge;
& generating Region Proposals adds significantly to even small number of data.

def run_training(epochs = 20):


dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, collate_fn=dataset.collate_fn,
shuffle=True, drop_last=True)
model = FRCNNModel(len(dataset.labels)).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")

207
print("------------Training Started-----------------")
train(epochs, model, dLoader, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

def run_validation():
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", int(ROWS/5),
False)
dLoader = DataLoader(dataset, batch_size=100, collate_fn=dataset.collate_fn,
shuffle=False, drop_last=True)
model = FRCNNModel(len(dataset.labels)).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
result = model.load_state_dict(torch.load(model.model_name))
print("\n")
print("------------Validation Started-----------------")
validate(model, dLoader, optimFn)
print("------------Validation Complete-----------------")
print("\n")

# to resume training.
def resume_training(epochs = 20):
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, collate_fn=dataset.collate_fn,
shuffle=True, drop_last=True)
model = FRCNNModel(len(dataset.labels)).to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")

208
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

# converts tensor to opencv image format(numpy array) by rearranging dimensions before


conversion.
def convert_tensor_image_to_opencv(tensor_img):
# 'tens_img' format = CxHxW (if using transforms.ToTensor()).
tensor_img = torch.permute(tensor_img,dims=(1,2,0))
tensor_img *= 255
return tensor_img.numpy().astype(np.uint8).copy() # format = HxWxC (1,2,0), as
required by cv2 images.

# test_on_input() performs prediction on a single input image.


def test_on_input(input_filename, row_number):
# (1) get original input image, RP in batches & ground truth BB.
input_image, gt_label, gt_bb = BusTruckDataset.load_for_testing(input_filename,
row_number)

# (2) load model.


model = FRCNNModel(len(BusTruckDataset.labels)).to(device)
# load previously trained model.
result = model.load_state_dict(torch.load(model.model_name))
model.eval()
# perform prediction. BB prediction output from model, will be in absolute (pixel)
values.
input_image_batch = torch.unsqueeze(input_image, 0) # add batch dimension.
outputs = model(input_image_batch)

bbs, labels, scores = BusTruckDataset.decode_outputs(outputs)

# convert tensor to opencv image for drawing BB & showing it in window.

209
input_image = convert_tensor_image_to_opencv(input_image)

print(f"GT label: {gt_label}\n")

color = (0,255,255) # yellow color for best index.


best_predicted_label = None
for index, bb in enumerate(bbs):
# get label name for predicted class number.
best_predicted_label = BusTruckDataset.labels[labels[index]]
# get score for this predicted label.
predicted_label_score = scores[index]
# Testing only - print prediction info.
print(f"Best Class label: {best_predicted_label} , Score:
{predicted_label_score}")
# draw text on image.
input_image = draw_bb_on_image(input_image, bb, color, best_predicted_label,
predicted_label_score)
color = (0,0,255) # red color for rest of the predicted BBs.

# draw actual ground truth BB (green color) on image.


input_image = draw_bb_on_image(input_image, gt_bb, (0,255,0), gt_label, 1)

cv2.imshow("window1", input_image)
cv2.waitKey(0)
print("Prediction completed.")

# draws a rectangle on image & returns the image.


def draw_bb_on_image(img, bb, color, best_predicted_label, predicted_label_score):
# perform validations.
startX = int(bb[0])
startX = startX if startX >=0 else 0
startY = int(bb[1])
startY = startY if startY >=0 else 0
startPoint = (startX, startY)
endX = int(bb[2])
endX = endX if endX <=BusTruckDataset.resize_value else
BusTruckDataset.resize_value
endY = int(bb[3])
endY = endY if endY <=BusTruckDataset.resize_value else
BusTruckDataset.resize_value
endPoint = (endX, endY)

210
# draw image.
img = cv2.rectangle(img, startPoint, endPoint, color, 1)
# display both label & it's predicted probability/score above BB in image.
if predicted_label_score == 1: # for ground truth BB, do not show score.
label_description = best_predicted_label
else:
label_description = best_predicted_label + " (" +
f"{predicted_label_score:0.3}" + ")"
img = cv2.putText(img, label_description, (startX, startY-5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
return img

#run_training(10)

#resume_training(10)

#run_validation()

test_on_input("../dataset/df.csv", 9000) # extract the 9000th row from csv for


testing.
#test_on_input("../dataset/df.csv", 9102)
#test_on_input("../dataset/df.csv", 9202)
#test_on_input("../dataset/df.csv", 2000)

# images upto 1000, already in training. Detection performs well in training images.
#test_on_input("../dataset/df.csv", 500)
#test_on_input("../dataset/df.csv", 600)

YOLO (You Only Look Once):

(I) Advantages of YOLO over Faster R-CNNs:

1) Faster R-CNN works on the concept of sliding anchors over image, hence it is
possible that regions are generated that do not fully encompass the object in
image, hence model has to guess the real BBs.
YOLO looks at the whole image while predicting the BB.

211
Faster RCNN performs detection on various region proposals and
ends up performing predictions multiple times of various regions of an image, on
the other hand, YOLO architecture is more like a fully connected convolutional
neural network (FCNN), the image passes through the FCNN once and then the
output gives the prediction.

2) Faster R-CNN is still slower, as we have 2 networks: RPN & final network for
prediction.
YOLO has a single network to look at the whole image at once, & make
predictions in real time.
Faster R-CNN offers region of interest (RoIs) to perform convolution
on it while YOLO does detection and classification at the same time.

(II) Output data format:

(1) Residual Blocks: Divide input image into NxN grids (say N=3) of equal
size (For YOLOv8, default grid size seems to be (32,32) - see source code in
ultralytics/yolo/engine/trainer.py).
YOLO performance depends on grid size.

212
(2) Bounding Box Regression:
(a) Identify those grid cells that contain the center of at least one ground
truth bounding box. In our case, they are cells b1 and b3 of our 3 x 3 grid image.
The cell(s) where the middle point of the ground truth bounding box falls
is/are responsible for predicting the bounding box of the object.
(b) The output ground truth corresponding to a cell is as
follows(considering there are 3 classes - c1=truck, c2=car, c3=bus):

Sample cell b1. Output format.

Note that each cell size is normalized/scaled in range from 0-1.

pc - (the objectness score) is the probability of the cell containing an object (any
of the classes).

213
Background class is not needed here, as pc=0 means the cell does not
contain any of the objects/classes from training data, so the rest of the data don't
make any sense.
bx, by - the location (offset) of the midpoint of the ground truth bounding box with
respect to the grid cell origin ((0.5, 0.5) in above grid cell image; as the midpoint
of the ground truth is at a distance of 0.5 units from the origin, from both X & Y
axis).
bw, bh - bw is the ratio of the width of the bounding box with respect to the width
of the grid cell (bw = bbW / gW in below image for cell b3). bh is the ratio of the
height of the bounding box with respect to the height of the grid cell (bh = bbH /
gH).

Grid cell b3.


c1,c2,c3 - probabilities of the cell containing a particular class in our example
classes (truck, car, bus).

Thus, the number of output nodes per cell is { 5 + <number of classes/labels> }


i.e. {5 + 3}=8, for 3 classes in our example. For NxN cells, we get { N x N x 8 }
outputs.
Example output for cell b1:

214
class in b1 cell being car; resulting in c2 being 1, while c1 and c3 are 0.

(3) IoU: a single object in an image can have multiple grid box
candidates for prediction (see anchors below), even though not all of them are
relevant. The goal of the IOU is to discard such grid boxes to only keep those
that are relevant.
Training Stage:
During the training process, YOLO calculates IoU between the predicted bounding boxes and
the ground truth bounding boxes to determine the quality of the predictions. IoU is computed
by dividing the area of intersection between the predicted and ground truth bounding boxes by
the area of their union. This gives a measure of how well the predicted box aligns with the
actual object location.
If the IoU is above a certain threshold (usually around 0.5 or 0.6), the predicted bounding box is
considered a "true positive" match for the corresponding object. It means the algorithm
successfully detected the object accurately.
If the IoU is below the threshold, the predicted bounding box is treated as a "false positive" or a
"false negative" depending on whether there is a ground truth object in that location. This
provides information for the algorithm to improve its predictions.

Inference Stage:
When using a trained YOLO model for object detection on new images, the algorithm predicts
bounding boxes for potential objects (each grid’s output will have a prediction for each class
with some confidence score). These predictions are then filtered using a process called non-
maximum suppression (NMS), which is based on IoU.
For each class, the predicted bounding boxes are sorted by their confidence scores (the
probability that the predicted box contains an object of that class). Starting from the box with the
highest confidence, YOLO compares the IoU of this box with the IoU of all subsequent boxes
for the same class.
If the IoU between two (predicted) boxes is above a certain threshold, the box with the lower
confidence score is suppressed (removed). This prevents multiple bounding boxes from
being detected for the same object.
This process is repeated until all boxes have been compared and potentially suppressed.

215
By using IoU in both training and inference, YOLO ensures that it detects objects accurately
and produces a single, well-aligned bounding box for each object, even in cases where objects
might overlap or be close to each other.

(4) NMS: Setting a threshold for the IOU is not always enough because
an object can have multiple boxes with IOU beyond the threshold, and leaving all
those boxes might include noise. Here is where we can use NMS to keep only
the boxes with the highest probability score of detection.

NOTE: There can be scenarios where there are multiple objects


within the same grid cell.
Ex:

In the preceding example, the midpoint of the ground truth bounding boxes for
both the car and the person fall in the same cell – cell b1.

Anchor boxes come in handy in such a scenario. Let's say we have two anchor
boxes – one that has a greater height than width (corresponding to the person)
and another that has a greater width than height (corresponding to the car).
The output for each cell in a scenario where we have two anchor boxes is
represented as a concatenation of the output expected of the two anchor boxes:

216
Here, bx, by, bw, and bh represent the offset from the anchor box (which is
the universe in this scenario as seen in the image instead of the grid cell). From
the preceding screenshot, we see we have an output that is 3 x 3 x 16, as we
have two anchors. The expected output is of the shape N x N x (num_classes +
1) x (num_anchor_boxes), where N x N is the number of cells in the grid,
num_classes is the number of classes in the dataset, and num_anchor_boxes is
the number of anchor boxes.

Architecture:

YOLO (v1) has overall 24 convolutional layers, four max-pooling layers, and two
fully connected layers.

217
YOLO v8:

Link: https://docs.ultralytics.com/#ultralytics-yolov8

One key feature of YOLOv8 is its extensibility. It is designed as a


framework that supports all previous versions of YOLO, making it easy to switch
between different versions and compare their performance. This makes YOLOv8
an ideal choice for users who want to take advantage of the latest YOLO
technology while still being able to use their existing YOLO models.

YOLOv8 is an anchor-free model. This means it predicts directly the center


of an object instead of the offset from a known anchor box. Anchor free detection
reduces the number of box predictions, which speeds up Non-Maximum
Suppression (NMS), a complicated post processing step that sifts through
candidate detections after inference.

In Anchor-free detectors, the predictions (per grid/cell) is reduced to 1 per


cell; as only the grid (& no anchors inside it) is responsible for BB prediction. We

218
can also increase the number of grids to accommodate for multiple detections in
a (previously) larger grid size, say for small objects that are very close to each
other.

The initial success of the anchor-based method led to more research,


which made it more accurate than anchor free object detection. However, recent
success in anchor free object detection made it equivalent to the anchor-based
method in terms of accuracy.

Following are a few advantages of anchor free methods over anchor-based:

(1) Finding suitable anchor boxes (in shape and size) is crucial in training an
excellent anchor-based object detection model. Finding suitable anchors is a
complex problem and may need hyper-parameter tuning.
(2) Using more anchors results in better accuracy in anchor-based object
detection but using more anchors comes at a cost. The model needs more
complex architecture, which leads to slower inference speed.
(3) Anchor free object detection is more generalizable. It predicts objects as
points that can easily be extended to key-points detection, 3D object detection,
etc. However, the anchor-based object detection solution approach is limited to
bounding box prediction.

YOLO dataset format:


Link: https://blog.paperspace.com/train-yolov5-custom-data/

YOLO expects the input dataset to have 2 sub folders: “images” & “labels”.
“images” contain (N) input images; and “label” contains N (.txt) files, where each
line of the text file (having the same name as the input image it corresponds to)
describes a bounding box.
Ex:

219
The annotation file for the image above looks like the following:

There are 3 objects in total (2 persons and one tie) in the above input
image. Each line represents one of these objects. The specification for each line
is as follows.
- One row per object
- Each row is class x_center, y_center, width, height format.
- Box coordinates must be normalized by the dimensions of the image (i.e.
have values between 0 and 1). Note that x_center, y_center, width, height values
are wrt input image’s normalized coordinates.
- Class numbers are zero-indexed (start from 0).

220
All the above information is to be provided to the model during training, via a
single YAML file (say data.yaml), that has the following information in below
format:
train: <path to training dataset ex: ../../dataset/yolo_data/train/images>
val: <path to validation dataset ex: ../../dataset/yolo_data/val/images>
test: <path to testing data>

# number of classes
nc: <N>

# class names
names: <example:["Bus","Truck"]>

YOLO expects to find the training/validation labels for the images in the folder
whose name can be derived by replacing “images” with “labels” in the path to
dataset images.
For example, if the input training data path is { /dataset/train/images }, YOLO will
look for training labels in { /dataset/train/labels }.

Installation:
pip install ultralytics (installs everything needed for Yolo-v8)

Usage:

(a) Prediction/Detection code example - from Python - using pretrained model:


Link: https://docs.ultralytics.com/tasks/detect/
Ex:

from ultralytics import YOLO

model = YOLO(<pretrained model name, say 'yolov8n.pt'>)


results = model.predict(<input image name>) # for detection. Can also
simply use model(<input image name>).
# results structure explanation: https://docs.ultralytics.com/modes/predict/

221
res_plotted = results[0].plot() # plot results(bb, masks, classification logits, etc)
on image.
cv2.imshow(res_plotted)

An alternate form of input for images, when providing the input image from
memory (such as numpy array or tensor) are shown below, along with shape
format & input value range, as per YOLO’s website, under “Predict” section:

Output data format:

# detections (during prediction) are outputted in the “runs” (runs/detect/predict) folder .


# get individual outputs from “result”. results.orig_img = Original image loaded in memory,
results.orig_shape = original image shape. results can have multiple objects, if inputs are
multiple i.e. say video or directory.
for result in results:
boxes = result.boxes # boxes contain (1) boxes (2D tensor of dim(N,6)),
(2) "cls"(tensor of size N), (3) "conf"(tensor of size N) for all (N) objects detected in the input
image.
Boxes shape: (N,6) => (4 BB values, conf, cls). format = [x1, y1, x2, y2, score,
label].
boxes.xywh|xyxy|xyxyn|.... (n=normalized).
all_boxes = boxes.boxes # contains only (N,6) values, no separate "cls" or "conf"
fields.
masks = result.masks # masks.segments(bounding coordinates of masks),
masks.data(raw masks tensor).
probs = result.probs # probabilities for classification tasks.

YOLOv8 can also be used for classification, segmentation &


pose/keypoint estimation.

YOLO also allows exporting it’s model in other formats like ONNX, etc.

(b) Training / Validation code:

222
Link: https://docs.ultralytics.com/modes/train/
Ex:

model = YOLO('yolov8n.yaml') # build a new model from YAML (for training only). This
yaml is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/v8/
yolov8.yaml.
OR
model = YOLO(<pretrained model name/path, say 'yolov8n.pt'>) # transfer
learning.
model.train(data='coco128.yaml', epochs=100, imgsz=640) # Train the model.
# for other training parameters info that can be provided, see
https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/cfg/default.yaml OR
https://docs.ultralytics.com/modes/train/#arguments. Can also specify “device” arg here, to run
training on GPU (or multiple GPU ex: device='0,1') or cpu(device='cpu'), batch, image
size(imgsz), optimizer, learning rate, etc.

Training outputs are available at path ‘runs/detect/train/weights/best.pt’, relative


to the file containing the training code.

223
Illustrated green curve is best, red curve is inferior. Other curves are actual
training data plottings.

There is always a tradeoff between Precision & Recall, because, if we keep the
threshold tight, Precision will be high, but Recall will be low. If we are liberal with
the threshold, Recall will be high, but Precision suffers.
A high area under the curve represents both high recall and high precision,
where high precision relates to a low false positive rate, and high recall relates to
a low false negative rate.

The Precision-Recall curve is a good way to evaluate the performance of


an object detector as the confidence is changed; by plotting a curve for each
object class. An object detector of a particular class is considered good if its
precision stays high as recall increases, which means that if you vary the
confidence threshold, the precision and recall will still be high. Another way to
identify a good object detector is to look for a detector that can identify only
relevant objects (0 False Positives = high precision), finding all ground truth
objects (0 False Negatives = high recall).

224
A poor object detector needs to increase the number of detected objects
(increasing False Positives = lower precision) in order to retrieve all ground truth
objects (high recall). That's why the Precision-Recall curve usually starts with
high precision values, decreasing as recall increases.

Another way to compare the performance of object detectors is to calculate the


area under the curve (AUC) of the Precision-Recall curve. As these curves are
often zigzag curves going up and down, comparing different curves (different
detectors) in the same plot usually is not an easy task - because the curves tend
to cross each other much frequently. That's why Average Precision (AP), a
numerical metric, can also help us compare different detectors. In practice AP is
the precision averaged across all recall values (for different confidence threshold
values), ranging between 0 and 1 - for a single class.
AP summarizes the PR Curve to one scalar value. Average precision is
high when both precision and recall are high, and low when either of them is low
across a range of confidence threshold values.

225
Average Precision Calculation techniques.

YOLO v8 model architecture:

YOLO comes with a long list of architectures. Some are large and some
are small, to train on large or small datasets. Configurations can have different
backbones. There are pre-trained configurations for standard datasets.

Following are the main modules in YOLOv8:

(a) Anchors - not used in v8.

(b) Backbone - The backbone of the YOLO v4 network (such as


(DenseNet derived)Darknet or EfficientNet) acts as the feature extraction network
that computes feature maps from the input images.

(c) Neck - The neck(such as FPN or BiFPN) connects the backbone and
the head. It is composed of a spatial pyramid pooling (SPP) module and a path
aggregation network (PAN). The neck concatenates the feature maps from
different layers of the backbone network and sends them as inputs to the head.

226
Adds a SPP block after backbone to increase the receptive field and separate out
the most important features from the backbone.

(d) Head - The head processes the aggregated features and


predicts/detects the bounding boxes, objectness scores, and classification
scores. YOLO has 3 detection heads to predict the bounding boxes, classification
scores, and objectness scores (indicating whether a desired object/class is
present in the grid. Values close to 0 means no object in the grid i.e. treat as
background).

NOTE:

YOLO v3 uses the concept of "Feature Pyramid Networks" (FPN). FPNs are a
CNN architecture used to detect objects at multiple scales. They construct a
pyramid of feature maps, with each level of the pyramid being used to detect
objects at a different scale. This helps to improve the detection performance on
small objects, as the model is able to see the objects at multiple scales.
Both YOLO v3 and YOLO v4 use anchor boxes with different scales and aspect
ratios to better match the size and shape of the detected objects.
YOLO v5 uses a more complex architecture called EfficientDet
(architecture shown below), based on the EfficientNet network architecture for
higher accuracy and better generalization to a wider range of object categories.
YOLO v5 uses a new method for generating the anchor boxes, called "dynamic
anchor boxes". It involves using a clustering algorithm to group the ground truth
bounding boxes into clusters and then using the centroids of the clusters as the
anchor boxes. This allows the anchor boxes to be more closely aligned with the
detected objects' size and shape.
YOLO v7 uses nine anchor boxes, which allows it to detect a wider range
of object shapes and sizes compared to previous versions, thus helping to
reduce the number of false positives.
A key improvement in YOLO v7 is the use of a new loss function called “focal
loss”. Previous versions of YOLO used a standard cross-entropy loss function,
which is known to be less effective at detecting small objects. Focal loss battles
this issue by down-weighting the loss for well-classified examples and focusing
on the hard examples - the objects that are hard to detect.

227
In simple words, Focal Loss (FL) is an improved version of Cross-
Entropy Loss (CE) that tries to handle the class imbalance problem by assigning
more weights to hard or easily misclassified examples.
YOLOv7 isn't just an object detection architecture - it provides
new model heads; that can output keypoints (skeletons) and perform instance
segmentation besides only bounding box regression, which wasn't standard with
previous YOLO models. This isn't surprising, since many object detection
architectures were repurposed for instance segmentation and keypoint detection
tasks earlier as well, due to the shared general architecture, with different outputs
depending on the task. Even though it isn't surprising - supporting instance
segmentation and keypoint detection will likely become the new standard for
YOLO-based models, which have begun outperforming practically all other two-
stage detectors a couple of years ago (link: https://stackabuse.com/pose-
estimation-and-keypoint-detection-with-yolov7-in-python/).
For keypoint/pose estimation in YOLOv8, see
https://docs.ultralytics.com/tasks/pose/.
Alternatively, we can use YOLO for obtaining the sub-image of detected
object(BB), then use our own custom model for pose/keypoint estimation on this
smaller sub-image.

Some block types used in YOLO architecture:


Link:
https://github.com/ultralytics/yolov5/blob/cdd804d39ff84b413bde36a84006f51769
b6043b/models/common.py?ref=blog.roboflow.com

- Concat - Concat is a slicing layer and is used to slice the previous layer.
- CBS(Convolution, BatchNorm, SiLU)
- C3(convolution block with 3 convolutions) - CSP(Cross Stage Partial layer) with
Bottlenecks i.e. C3 is composed of three convolution layers and a module
cascaded by various bottlenecks.
- C2f(convolution block with 2 convolutions)
- Bottleneck(two 3x3 convs with residual connections) - A bottleneck layer is a
layer that contains few nodes compared to the previous layers. It can be used to
obtain a representation of the input with reduced dimensionality.
- SPP(Spatial Pyramid Pooling) - a type of pooling layer used to reduce the
spatial resolution of the feature maps. SPP is used to improve the detection

228
performance on small objects, as it allows the model to see the objects at
multiple scales. Also to increase the receptive field and separate out the most
important features from the backbone. Also used to remove the fixed size
constraint of the network.
It enables the network to accept inputs of different sizes and generate fixed-
length feature representations by dividing the input feature map into regions of
different sizes and pooling features from each region separately.

Learning rate in YOLO:

The lr0 parameter is the initial learning rate, and lrf is the multiplicative factor for
the final learning rate at the last epoch of training i.e. final lr = [lr0 * lrf] - so if
lr0(starting lr) = 0.1, and lrf=0.01; then final lr = (0.1 * 0.01) = 0.001.
If lrf = 1, learning rate remains constant throughout.
By default, in YOLOv8, both lr0 and lrf have the same value (0.01), and
this value is used during training.
Regarding the cos_lr parameter, if it is set to True, then the learning rate
schedule will follow a cosine annealing pattern rather than a linear schedule. This
can lead to a smoother learning rate schedule and potentially better results. Both
lr0 and lrf are still used in the cosine annealing learning rate schedule.

229
3) Code for YOLOv8 on Bus-Truck dataset:

Preparing data:

import pandas as pd

230
from sklearn.model_selection import train_test_split
import shutil # for file copying operations.
import os

# This file has helper functions for preparing & copying Bus-Truck data in a format &
path structure, as required by YOLO for training purposes.

"""
custom training:
1) Prepare a yaml file (say data.yaml) that contains
(a) the paths to training & validation data/images (containing the input images &
label txt files(1 txt file per image - YOLO expects annotations for each image in form
of a .txt file where each line of the text file describes a bounding box)
respectively).
(b) number of output classes/labels.
(c) names of the labels.

2) create model: (model = YOLO(<pretrained weights name>)).

3) train model: model.train(<data.yaml file path & name>, epochs)


"""

# prepare label .txt file data from csv; for each image in in Bus-Truck dataset. The
labels are in accordance with YOLO's requirement i.e. as many (<class number> <BB X-
center> <BB Y-center> <BB width> <BB height>) as BBs in a given image, in the label
file for that image.
def prepare_BusTruck_labels():
#images_path = "../dataset/images"
labels = ["Bus", "Truck"]
labels_path = "../dataset/labels/" # this will contain the output label txt
files for all images in "images_path".
df = pd.read_csv("../dataset/df.csv") # load entire dataset labels file.
unique_images = df["ImageID"].unique().tolist() # 15225 unique images.
print(f"Data preparation ({len(unique_images)} images) start:")
for image in unique_images:
rows_for_image = df[df["ImageID"] == image]
# create a new txt file & write all rows(data) corresponding to "image", in
this file.
with open(f"{labels_path}{image}.txt","x") as f:
for _, row in rows_for_image.iterrows():

231
class_number = labels.index(row["LabelName"]) # get class number from
values(0 based).
# values are start/end pixels. Convert them the BB center points &
width-height.
x_center = (row["XMin"] + row["XMax"]) / 2
y_center = (row["YMin"] + row["YMax"]) / 2
width = row["XMax"] - row["XMin"]
height = row["YMax"] - row["YMin"]
# string containing all the data.
row_data = f"{class_number} {x_center} {y_center} {width} {height}"
f.write(row_data)
f.write("\n")
print("prepare_BusTruck_labels() completed.")

#prepare_BusTruck_labels()

# copies all input files in "ids" from source to destination.


def copy_files_to_destination(ids, srcPath, destPath, output_subpath, extension):
# first, empty the destination directory.
dir_name = destPath + output_subpath
shutil.rmtree(dir_name) # remove existing directory.
os.mkdir(dir_name) # create new empty directory.
for id in ids:
src_file_name = srcPath + id + extension
dest_file_name = destPath + output_subpath + id + extension
shutil.copy(src_file_name, dest_file_name)

# generate_train_validate_data() divides & copies the input images & their labels, to
train & val folder, to be used by YOLO as training & validation data respectively.
def generate_train_validate_data():
df = pd.read_csv("../dataset/df.csv") # load entire dataset.
unique_images = df["ImageID"].unique()
# split database into training(train_ids) & validation(val_ids).
train_ids, val_ids = train_test_split(unique_images, test_size=0.1,
random_state=99)
print(f"Training data len: {len(train_ids)}, Validation data len: {len(val_ids)}")
# path to actual input images & their labels, all at one place.
input_image_file_path = "../dataset/images/"
input_label_file_path = "../dataset/labels/"

232
# path where input data is to be copied, under "train" & "val" folders.
output_file_path = "../dataset/yolo_data/"

# TRAINING data copy.


# copy all input (image) files from src to training dest(in "yolo_data").
copy_files_to_destination(train_ids, input_image_file_path, output_file_path,
"train/images/", ".jpg")
# copy all output (label) files from src to training dest(in "yolo_data").
copy_files_to_destination(train_ids, input_label_file_path, output_file_path,
"train/labels/", ".txt")

# VALIDATION data copy.


# copy all input (image) files from src to validation dest.
copy_files_to_destination(val_ids, input_image_file_path, output_file_path,
"val/images/", ".jpg")
# copy all output files from src to validation dest.
copy_files_to_destination(val_ids, input_label_file_path, output_file_path,
"val/labels/", ".txt")
print("Generating Data for YOLO complete.")

#generate_train_validate_data()

data.yaml (contains training/validation paths, classes info):

train: ../../dataset/yolo_data/train/images # not sure why "../../" is needed, even


though dataset is only 1 level up from data.yaml's location (an additional
"/datasets/" path is added after "code/" automatically by YOLO).
val: ../../dataset/yolo_data/val/images
#test: ../../dataset/yolo_data/test # optional.

# number of classes
nc: 2

# class names
names: ['Bus','Truck']

YOLOv8 training/validation/Prediction:

233
from ultralytics import YOLO
import cv2

# train_model() is used to train a YOLOv8 model with raw or pretrained


weights(transfer learning), on bus-Truck dataset.
def train_model(epochs=10):
model = YOLO("yolov8n.pt") # load pretrained weights. will download the first
time.
print("\nTraining started\n")
model.train(data="data.yaml", epochs=epochs)
#success = model.export(format="onnx") # supports other formats too(TF,
torchscript, etc).
print("\nTraining complete.\n")

# resume_training() is used to further train an already trained model, whose weights


file path is mentioned in "model_path".
def resume_training(model_path, epochs=10):
model = YOLO(model_path) # load pretrained weights. will download the first
time.
print("\nTraining resumed\n")
model.train(data="data.yaml", epochs=epochs)
print("\nTraining complete.\n")

# perform validation.
def validate_model():
# link: https://docs.ultralytics.com/modes/val/
model = YOLO("runs/detect/train/weights/best.pt") # no arguments needed, dataset
and settings remembered.
metrics = model.val(data="data.yaml") #model.val() #results =
model.val(data="data.yaml")
#metrics.box.map # map50-95
#metrics.box.map50 # map50
#metrics.box.map75 # map75
print(metrics.box.maps) # a list contains map50-95 of each category.
print("\nValidation complete.\n")

234
# predict() is used to perform inference using an already trained model.
def predict(img_name):
model = YOLO("runs/detect/train/weights/best.pt")
# pick up from validation folder data.
img_path = "../dataset/yolo_data/val/images/"
# "stream=True" returns generator(memory efficient). Input to predict() can be
image path, url, opencv|PIL image, tensor, directory, video, etc. link:
https://docs.ultralytics.com/modes/predict/#image-formats for more info on input &
results structure.
results = model.predict(img_path + img_name)
res_plotted = results[0].plot()

# show plotted(annotated) image in window.


cv2.imshow("window1", res_plotted)
cv2.waitKey(0)
print(f"results length: {len(results)}")
# get individual outputs from result. results.orig_img = Original image loaded in
memory, results.orig_shape = original image shape. results can have multiple objects,
if inputs are multiple i.e. say video or directory.
for result in results:
# boxes contains boxes(2D tensor of dim(N,6)), "cls"(tensor), "conf"(tensor)
for all (N) objects detected in input image. (N,6)=> (4 BB values, conf, cls).
boxes = result.boxes # box = boxes[0], boxes.xywh|xyxy|xyxyn|... format
= [x1, y1, x2, y2, score, label].
boxes.xyxy
all_boxes = boxes.boxes # contains only (N,6) values, no "cls" or "conf"
fields.
if all_boxes.shape[0] == 0:
print("\nNo Detections")
else:
print_all_detections(all_boxes, boxes.cls, boxes.conf, result.names)
masks = result.masks # masks.segments(bounding coordinates of
masks), masks.data(raw masks tensor).
probs = result.probs # probabilities for classification task.
print("\nPrediction completed.\n")

# print_all_detections() extracts all BBs, their corresponding classes(string names) &


confidence score, from "all_bbs".
def print_all_detections(all_bbs, classes, confidence, names):
for i in range(all_bbs.shape[0]):
output_class_name = names[int(classes[i].item())]

235
print(f"\nBB: {all_bbs[i]}\nClass: {output_class_name}\nConfidence:
{confidence[i]}\n")

#train_model()

#resume_training("runs/detect/train/weights/best.pt", 10)

#validate_model()

#predict("00a2c3be48bcef93.jpg") # truck - 237.9 ms on cpu.


#predict("0a9f6021cf16c680.jpg") # bus - 207 ms on cpu.
#predict("0a187bec26743dd4.jpg") # truck, but not clear - model could not detect.
predict("0abd21baef23d872.jpg") # bus

Note on YOLO metrics:

mAP(B) - Mean Average Precision (BoundingBox):


mAP(B) is the standard Mean Average Precision metric calculated for object
detection tasks, focusing on bounding boxes. It measures the accuracy of the
model in localizing objects within the input image. The process involves
calculating the Average Precision (AP) for each class based on the bounding box
predictions. AP represents the area under the precision-recall curve and
quantifies the model's ability to provide accurate bounding boxes for different
object classes.

mAP(P) - Mean Average Precision (Pixel):


mAP(P) is a variant of mAP that evaluates the pixel-level accuracy of the object
detection model. Instead of evaluating the accuracy of the bounding boxes,
mAP(P) evaluates the model's ability to provide accurate object segmentation
masks (pixel-wise predictions). This metric is useful when precise object
segmentation is crucial in the application, such as in instance segmentation
tasks.

Both mAP(B) and mAP(P) provide valuable insights into the model's
performance, depending on the specific requirements of the application.
Typically, mAP(B) is more commonly used for standard object detection tasks,

236
while mAP(P) is used when precise pixel-level segmentation is required, as in
instance segmentation.

COCO Evaluation Metrics:

1) Object Detection / Segmentation: https://cocodataset.org/#detection-eval

2) POSE - object keypoint similarity (OKS): https://cocodataset.org/#keypoints-


eval
Keypoints evaluation metric is used when predicting keypoints, such
as in YOLO-Pose models.
It is calculated from the distance between predicted points and ground truth
points normalized by the scale of the person/entity, as well as other affecting
conditions such as fallout (individual keypoint specific - measure of false
positives. It quantifies the proportion of incorrectly detected keypoints or keypoint
associations), occlusions, etc.

Resuming vs Extending Training in YOLO:

If your training was interrupted for any reason you may continue (resume) where
you left off using the --resume argument (in python: model.train(resume=True)).
YOLO Learning Rate (LR) schedulers follow predefined LR curves for the
fixed number of --epochs defined at training start, and are designed to fall to a
minimum LR on the final epoch for best training results. For this reason you can
not modify the number of epochs once training has started.

If your training was fully completed, you can start a new training (extended to
more epochs) from any model (checkpoint) using the --weights argument i.e. If
you would like to start training from a (previously) fully trained model, use the --
weights argument, not the --resume argument:
Ex (CLI): python train.py --weights path/to/best.pt # start from a pretrained model.
For Python, the flag pretrained can be set to True, to load weights from a (path to
a) pretrained model.

237
To customize YOLO (say model, loss function, etc), see
https://docs.ultralytics.com/usage/engine/.

To customize loss function in YOLO:

(1) Create a custom trainer class derived from a base trainer class (PoseTrainer
for PoseModel for key points) (see link:
https://docs.ultralytics.com/usage/engine/).

(2) Create a custom model class derived from a suitable Base model class
(PoseModel for key points) (see link:
https://docs.ultralytics.com/reference/nn/tasks/#ultralytics.nn.tasks.PoseModel).

(3) To customize loss function for our custom model:


(a) Create a custom loss class derived from a suitable loss class (such as
from "class v8PoseLoss(v8DetectionLoss)"), that can be used in our custom
model's init_criterion() function (see link:
https://docs.ultralytics.com/reference/utils/loss/#ultralytics.utils.loss.v8PoseLoss).
(b) Override custom model's init_criterion() function, that is defined in
ultralytics.nn.tasks.BaseModel, to set/return the (custom) loss class/function (say
derived from v8PoseLoss()).

CODE:

Customization:
Below example code illustrates how to customize YOLO loss function for
Pose (key points prediction), by customizing YOLO’s Trainer, inside trainer it’s
Model, and finally inside model, it’s Loss function.

from ultralytics.models.yolo.pose import PoseTrainer # for deriving custom trainer.


from ultralytics.nn.tasks import PoseModel, v8PoseLoss # for deriving custom model
from PoseModel, custom loss from v8PoseLoss.
from ultralytics.utils import DEFAULT_CFG, yaml_load

# This file contains definitions for custom trainer, model & loss functions in YOLO.

# Define CUSTOM LOSS class, derived from Ultralytics v8PoseLoss.


class CustomPoseLoss(v8PoseLoss):

238
def __init__(self, model, loss_scaler_param):
super().__init__(model)
self.loss_scaler = loss_scaler_param

def __call__(self, preds, batch):


# return super().__call__(preds, batch)
loss, loss_detached = super().__call__(preds, batch)
"""
# loss is returned from super() class as below:
loss[0] *= self.hyp.box # box gain
loss[1] *= self.hyp.pose / batch_size # pose gain
loss[2] *= self.hyp.kobj / batch_size # kobj gain
loss[3] *= self.hyp.cls # cls gain
loss[4] *= self.hyp.dfl # dfl gain

return loss.sum() * batch_size, loss.detach()


"""
# tensor shape: loss = [1], loss_detached = [5].
loss_detached[1] *= self.loss_scaler # increase loss 'pose' magnitude.
loss_detached[2] *= self.loss_scaler # increase loss 'kobj' magnitude.
# scale loss with pose & kobj loss.
scaled_loss = loss + loss_detached[1] + loss_detached[2]
return (scaled_loss, loss_detached)

# Define CUSTOM MODEL class, derived from Ultralytics PoseModel.


class CustomPoseModel(PoseModel):
def __init__(self, loss_scaler_param=1, cfg=None, ch=3, nc=None,
data_kpt_shape=(None, None), verbose=True):
super().__init__(cfg, ch, nc, data_kpt_shape, verbose)
# set the scaler variable, that increases the magnitude of the loss by a
scalar value.
self.loss_scaler = loss_scaler_param

def init_criterion(self):
# set your custom loss function here.
super().init_criterion()
return CustomPoseLoss(self, self.loss_scaler) # return custom loss class.
1st param is model object (self). 2nd param is the loss scaler.

239
# Define CUSTOM TRAINER for pose, derived from Ultralytics PoseTrainer.
# Usage example: CustomPoseTrainer(overrides=<dict>, loss_scaler_param = 10)
class CustomPoseTrainer(PoseTrainer):
def __init__(self, cfg=DEFAULT_CFG, overrides=None, _callbacks=None,
loss_scaler_param=1):
super().__init__(cfg, overrides, _callbacks)
self.loss_scaler = loss_scaler_param
self.overrides = overrides # will be used in set_overrides_in_config(),
that is called in get_model(); to override the contents from “DEFAULT_CFG”.

# This function is used for manually overriding default config (cfg) with
overrides parameter. This should have happened automatically in Ultralytics
super().__init__() though.
def set_overrides_in_config(self, cfg):
yaml_override = yaml_load(self.overrides['data'])
cfg['kpt_shape'] = yaml_override['kpt_shape']
cfg['names'] = yaml_override['names']
return cfg

def get_model(self, cfg=None, weights=None, verbose=True):


#return super().get_model(cfg, weights, verbose)
cfg = self.set_overrides_in_config(cfg)
return CustomPoseModel(self.loss_scaler, cfg, weights, verbose)

Usage:
Illustrates how to use the above customized classes for training.

from yolo_custom_trainer import CustomPoseTrainer

# this file trains using custom trainer; with a custom YOLO model (that has custom
loss function set internally, from the file "yolo_custom_trainer.py").

def train(e=30):
args = dict(model='yolov8n-pose.pt', data='yolo-pose.yaml', epochs=e)
trainer = CustomPoseTrainer(overrides=args, loss_scaler_param=10) #
loss_scaler_param = magnitude to scale the loss with.
print("Training Started:\n\n")
trainer.train()
print("Training Complete.")
return

240
train(60)

# code for resuming training.


def resume_training(e=30):
prev_trained_weights = 'runs/pose/train/weights/last.pt' # path to previously
trained weights.
# pretrained flag: (bool or str) whether to use a pretrained model (bool) or a
model to load weights from (str).
args = dict(model=prev_trained_weights, data='yolo-pose.yaml', epochs=e,
pretrained=prev_trained_weights)
trainer = CustomPoseTrainer(overrides=args, loss_scaler_param=10) #
loss_scaler_param = magnitude to scale the loss with.
print("Extended Training Starts:\n\n")
trainer.train()
print("Extended Training Complete.")
return

#resume_training(40)

To customize Preprocessing of input data:

Code:

# Define custom preprocessing function.


def my_custom_preprocessing_function(image):
# Preprocessing - apply desired colormap to grayscale image.
colored_image = cv2.applyColorMap(image, cv2.COLORMAP_HOT)
return colored_image

# Pass custom preprocessing function to create_dataloader.


from utils.dataloaders import create_dataloader

train_dataloader = create_dataloader(train_path, imgsz=imgsz,


batch_size=batch_size, quad=False, rect=False, cache=False, single_cls=False,
augment=False, workers=workers, pad=0.0,
preproc_fn=my_custom_preprocessing_function)

241
Once you have created your custom dataloader using the create_dataloader
function and defined your custom preprocessing function, you can pass it directly
to your train function.

Instead of passing the .yaml file as the data argument, you can pass the
train_dataloader you created as the imgsz argument. Also, you can remove the
data and batch_size arguments because they will be taken care of by the
train_dataloader you created.

Code:

model = YOLO(...)
model.train(train_dataloader, imgsz=imgsz, epochs=20)

The train function will automatically use the dataloader you pass as the first
argument.

SSD (Single Shot Detector):

In a network, different layers have different receptive fields to the original image.
For example, the initial layers have a smaller receptive field when compared to
the final layers, which have a larger receptive field.
SSD leverages this phenomenon, to predict classes & BBs:
(a) We leverage the pre-trained (say VGG) network and extend it with a few
additional layers until we obtain a 1 x 1 block.
(b) Instead of leveraging only the final layer for bounding box and class
predictions, we will leverage all of the last few layers to make class and
bounding box predictions (difference between SSD & other detectors like YOLO).
(c) In place of anchor boxes, we will come up with default boxes that have a
specific set of scale and aspect ratios. Each of the default boxes should predict
the object and bounding box offset just like how anchor boxes are expected to
predict classes and offsets.

242
SSD doesn’t use k-means to find the anchors. Instead it uses a mathematical
formula to compute the anchor sizes. Therefore, SSD’s anchors are
independent of the dataset (hence named default/prior boxes).
Another small difference: YOLO’s (initial versions) anchors are just a width and
height, but SSD’s anchors also have an x,y-position. YOLO simply assumes that
the anchor’s position is always in the center of the grid cell (for SSD this is also
the default thing to do).
Thanks to anchors, the detectors don’t have to work very hard to make
pretty good predictions already, because predicting all zeros simply outputs the
anchor box, which will be reasonably close to the true object (on average). This
makes training a lot easier. Without the anchors, each detector would have to
learn from scratch what the different bounding box shapes look like… a much
harder task.

SSD Network Architecture.

Model Summary:

from torchvision.models.detection import ssd300_vgg16,


SSD300_VGG16_Weights
model = ssd300_vgg16(weights=SSD300_VGG16_Weights.DEFAULT)
print(model)

Output:

243
SSD(
(backbone): SSDFeatureExtractorVGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=True)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace=True)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
)
(extra): ModuleList(
(0): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): ReLU(inplace=True)
(3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): ReLU(inplace=True)
(5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Sequential(
(0): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
(1): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6))
(2): ReLU(inplace=True)
(3): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1))
(4): ReLU(inplace=True)
)
)

244
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(3): ReLU(inplace=True)
)
(2): Sequential(
(0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(3): ReLU(inplace=True)
)
(3): Sequential(
(0): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
(3): ReLU(inplace=True)
)
(4): Sequential(
(0): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
(3): ReLU(inplace=True)
)
)
)
(anchor_generator): DefaultBoxGenerator(aspect_ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]],
clip=True, scales=[0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05], steps=[8, 16, 32, 64, 100, 300])
(head): SSDHead(
(classification_head): SSDClassificationHead(
(module_list): ModuleList(
(0): Conv2d(512, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(256, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(regression_head): SSDRegressionHead(
(module_list): ModuleList(
(0): Conv2d(512, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

245
(3): Conv2d(256, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(transform): GeneralizedRCNNTransform(
Normalize(mean=[0.48235, 0.45882, 0.40784], std=[0.00392156862745098,
0.00392156862745098, 0.00392156862745098])
Resize(min_size=(300,), max_size=300, mode='bilinear')
)
)

The DefaultBoxGenerator class is responsible for generating the default boxes of


SSD and operates similarly to the AnchorGenerator of FasterRCNN. It produces
a set of predefined boxes of specific width and height which are tiled across the
image and serve as the first rough prior guesses of where objects might be
located.

The SSDHead class is responsible for initializing the Classification and


Regression parts of the network.
- Both the Classification and the Regression head inherit from the same
class which is responsible for making the predictions for each feature map.
- Each level of the feature map uses a separate 3x3 Convolution to estimate
the class logits and box locations.
- The number of predictions that each head makes per level depends on the
number of default boxes and the sizes of the feature maps.

246
Inputs:

The input to the model is expected to be a list of tensors, each of shape [N, C, H,
W], for one image, it is [1, C, H, W]. Image values should be in 0-1 range.
The behaviour of the model changes depending if it is in training or
evaluation mode.

Training:

During training, the model expects both the input (image) tensors (1st arg), as
well as a targets (list of dictionary), containing:
- boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format,
with 0 <= x1 < x2 <= W and 0 <= y1 < y2 <= H. i.e. {top-left=(x1,y1) , bottom-
right=(x2,y2)}.
- labels (Int64Tensor[N]): the class label for each ground-truth box.
Ex: [{"boxes":<Tensor.shape(N,4)>, "labels":<Tensor.shape(N)>},
{"boxes":<Tensor.shape(N,4)>, "labels":<Tensor.shape(N)>}, ...]
The length of this list is equal to the number of images in input (1st arg).

The model returns a Dict[Tensor] during training, containing the classification and
regression losses.

Inference:

During inference, the model requires only the input tensors, and returns the post-
processed predictions as a List[Dict[Tensor]], one for each input image. The
fields of the Dict are as follows, where N is the number of detections:

- boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format,
with 0 <= x1 < x2 <= W and 0 <= y1 < y2 <= H.
- labels (Int64Tensor[N]): the predicted labels for each detection
scores (Tensor[N]): the scores for each detection.

247
NOTE: YOLO is usually used when accuracy is more important, SSD
is used when speed (higher frame rate) is more desired, trading off for accuracy;
in general.

DETR (Detection Transformer):

DETR is a technique that leverages transformers to come up with an end-to-end


pipeline that simplifies the object detection network architecture considerably.
Transformers are one of the more popular and more recent techniques to
perform various tasks in NLP.

The authors view object detection as a set prediction problem (with absolute
box prediction w.r.t. the input image rather than an anchor). A set prediction
problem is when you try to guess a group of items based on some information.
Thinking of object detection as a set prediction problem has some
challenges. The biggest one is getting rid of duplicate predictions.
By treating object detection as a set prediction problem, the need for
manually designed components previously required in object detection
tasks to incorporate prior knowledge is eliminated, like spatial anchors or
non-maximal suppression. This approach simplifies the process and
streamlines the task.

DETR predicts all objects at once, and is trained end-to-end with a set loss
function which performs bipartite matching between predicted and ground-truth
objects.

DETR can also be used for performing Segmentation.

Paper name: End-to-End Object Detection with Transformers (by Meta AI).

Transformer:

248
Transformers have proven to be a remarkable architecture for sequence-to-
sequence problems. This class of networks uses only linear layers and softmax
to create self-attention. Self-attention helps in identifying the
interdependency among words in the input text. The input sequence typically
does not exceed 2,048 items as this is large enough for text applications.
However, if images are to be used with transformers, they have to
be flattened, which creates a sequence in the order of thousands/millions of
pixels (as a 300 x 300 x 3 image would contain 270K pixels), which is not
feasible. Facebook Research came up with a novel way to bypass this restriction
by giving the feature map (which has a smaller size than the input image) as
input to the transformer.

Multihead Attention.

At the heart of a transformer is the self-attention module. It takes


three two-dimensional matrices (called query (Q), key (K), and value (V)
matrices) as input.
“key” is a linear projection of input i.e. input multiplied by some weights. So
is “value”. “query” is the entity (or say target) that you are going to attend to (i.e.
pay attention to), in the input sequence.
The reason for using 2 matrices ‘key’ & ‘value’ from the input
is so that during learning, the gradients updation (from dot-product & attention-
weight-multiplication modules respectively) during backpropagation, do not
conflict with each other (see step 2 in figure Multihead Attention).

249
Each K vector comes paired with a V value. The greater the compatibility
of a given K with Q, the greater influence the concerned V will have on the output
of the attention mechanism.
These Q, K & V are learnt by the model during training.
K & V come from encoder outputs that are fed to the decoder; while Q
comes from the decoder’s masked MHA (MMHA), which is propagated further in
the decoder.

WORKING:

In a hypothetical scenario where the sequence length is 3, we have three word


embeddings (W1, W2, and W3) as input. Say each embedding is of size 512.
Each of these embeddings is individually converted into three additional vectors,
which are the query, key, and value vectors corresponding to each input.

Since each vector is 512 in size, it is computationally expensive to


do a matrix multiplication between them. So, we split each of these vectors into
eight parts (heads), having eight sets of (64 x 3) vectors for each of key, query,
and value tensor, where 64 is obtained from 512 (embedding size) / 8 (multi-
heads) and 3 is the sequence length.
Multi-head attention is a simple optimization that seems to be able to
increase the performance of a model without a significant increase in
computational cost.
Also using multiple-heads can enable the transformer to learn
different representations simultaneously. The multiple heads are created along
the embedding dimension, not the input sequence dimension. So, each head

250
sees all the words in the sequence, but only a part of their embeddings - it sees
the full sentence, but a different aspect of each word i.e. each head will try to
relate all the words with each other using different aspects of the same word.

Note that there will be eight sets of tensors of Kw11, Kw12, and so on because
there are eight multi-heads.

In each part, we first perform matrix multiplication between the key and query
matrices. This gives a result that indicates what in the input (key) is relevant, with
respect to (or from the point of view of) the query.
This way, we end up with a 3 x 3 matrix. Pass it through softmax activation.
Now, we have a matrix showing how important each word is, in relation to every
other word, as high probabilities. Elements with low probabilities indicate words
that are not important/relevant (with respect to the query) - also called “masking
out”.
Dot products are used to find the cosine similarity between two
vectors. The dot-product between tensors simply expands this to find many
different cosine similarities between the different vectors inside the tensors.

Finally, we perform matrix multiplication of the preceding tensor output with the
value tensor to get the output of our self-attention operation.
So, the output that we get here (attention) contains the information of:
(a) original word embeddings for all words in the input sentence.
(b) positional encoding.
(c) interaction of each word with all other words in the input.

We then combine the eight outputs of this step, go back using concat layer
(step3 in the following diagram), and end up with a single tensor of size 512 x 3.

251
Because of the splitting of the Q, K, and V matrices, the layer is also called multi-
head self-attention.

The idea behind such a complex-looking network is as follows:


- Values (Vs) are the processed embeddings that need to be learned for a
given input, in its context of key and query matrices.
- Queries (Qs) and Keys (Ks) act in such a way that their combination will
create the right mask so that only the important parts of the value matrix are
fed to the next layer.

For our example in computer vision, when searching for an object such as a
horse, the query should contain information to search for an object that is large in
dimension and is brown, black, or white in general. The softmax output of scaled
dot-product attention will reflect those parts of the key matrix that contain this
colour (brown, black, white, and so on) in the image. Thus, the values output
from the self-attention layer will have those parts of the image that are roughly of
the desired colour and are present in the values matrix.
We use the self-attention block several times in the network.

Sample code to build a Transformer block in Pytorch:

from torch import nn

transformer = nn.Transformer(hidden_dim, nheads, num_encoder_layers,


num_decoder_layers)
# Here, hidden_dim is the size of the embeddings, nheads is the number of
heads in the multi-head self attention, and num_encoder_layers and
num_decoder_layers are the number of encoding and decoding blocks in the network,
respectively.

Working of DETR:

The matching loss function uniquely assigns a prediction to a ground truth


object, and is invariant to a permutation of predicted objects (i.e. their order), so
we can emit them in parallel. The loss produces an optimal bipartite matching

252
between predicted and ground truth objects, and then optimizes object-specific
(bounding box) losses.
DETR demonstrates significantly better performance on large objects, a
result likely enabled by the non-local computations of the transformer. It obtains,
however, lower performances on small objects.

There are few key differences between a normal transformer network and DETR:
1) Our input is an image, not a sequence. So, DETR passes the image
through a ResNet backbone to get a vector (feature map) of size 256 that can be
then treated as a sequence (tokens).
2) The inputs to the decoder are object-query embeddings, which are
automatically learned during training. These act as the query matrices for all the
decoder layers.
3) This enriched feature map of the image is given to a transformer encoder-
decoder, which outputs the set of box predictions, through Feed Forward
Network (FFN) i.e. prediction heads. Each of these boxes consists of a tuple. The
tuple will be a class and a bounding box.
Note: This also includes the class NULL or Nothing class as
well (background class).
4) Unlike the original transformer that outputs the prediction sequence in time
steps i.e. one element at a time (during inference), our transformer outputs all the

253
N predictions (labels & their bounding boxes) in parallel, at the same time, at
each decoder layer.

Now, this is a real problem as in the annotation there is no object class annotated
as nothing. Comparing and dealing with similar objects next to each other is
another major issue and in this paper, it is tackled by using Bipartite matching
loss (bipartite matching for assigning predictions to ground truth uniquely, then
using Hungarian loss for calculating loss - includes the matching loss &
bounding box loss).

The loss is compared by comparing each class and bounding box there is with its
corresponding class and box (predictions) including the none class, which are
let’s say N; with the annotation including the part added that contains nothing to
make the total boxes N. The assignment of the predicted to the actual is a one to
one assignment such that the total loss is minimized. There is a very famous
algorithm called the Hungarian method to compute these minimum matching.
i.e. With the model always outputting a fixed number of object labels, many of
the object labels need to be classified as “No Object”. Additionally, these objects
could be in any order because any of the object queries could detect a specific
object. Therefore, when training the model, the labeled data is padded with “No
Object” labels to match with the size of the model’s output. A bipartite matching
algorithm is then used to match each object with their respective true value.

254
The matching algorithm completes the matching by minimizing the loss
due to mismatched or poorly localized objects. If there is a mismatch in objects,
the algorithm is forced to match predictions that are not the same as the label,
penalizing the model loss greatly. Extra object predictions will be forced to match
with “No Object” labels. The model, therefore, directly learns to predict the
correct number of objects, so there is no need to use non-max suppression.

Because the model creates a fixed number of object labels, there is


a hard limit on how many distinct objects DETR can detect. As you can imagine,
the more objects there are to detect, the more likely an object will be missed. If it
is expected that there will be too many object instances, the number of object
queries can be increased at the cost of extra computational time.

255
DETR Architecture.

256
From Paper "End-to-End Object Detection with Transformers":

Transformer encoder:
The encoder expects a sequence as input, hence we collapse the
spatial dimensions of z0 into one dimension, resulting in a d×HW feature map.
Each encoder layer has a standard architecture and consists of a multi-head
self-attention module and a feed forward network (FFN). Since the transformer
architecture is permutation-invariant, we supplement it with fixed positional
encodings that are added to the input of each attention layer.

Transformer decoder:
The decoder follows the standard architecture of the transformer,
transforming N (N=100 in paper "End-to-End Object Detection with
Transformers") embeddings of size d using multi-headed self- and encoder-
decoder attention mechanisms. The difference with the original transformer is
that our model decodes the N objects in parallel at each decoder layer.
Since the decoder is also permutation-invariant, the N input embeddings
must be different to produce different results. These input embeddings are
LEARNT(*) positional encodings that we refer to as object queries, and similarly
to the encoder, we add them to the input of each attention layer. The N object
queries are transformed into an output embedding by the decoder.
They are then independently decoded into box coordinates and class
labels by a feed forward network (FFN), resulting in N final predictions. Using self
and encoder-decoder attention over these embeddings, the model globally
reasons about all objects together using pairwise relations between them, while
being able to use the whole image as context.

(*): The LEARNT positional encodings/Object queries (weights) that are


used as input to the decoder (in addition to the encoder input - the input image),
makes the decoder output "output embeddings", that help in recognising the
objects & their Bounding Boxes (BBs) in the input image, during inference (via
the FFN).
These weights are learnt during training, such that they correctly classify &
localize the objects wrt their ground truth values in the training dataset. During
training, the weights are adjusted such that the model's output (via FFNs) is
sufficiently close to the ground truth's objectness, class type & BBs - when the
provided input image contains the target objects.

257
FFN: The final prediction is computed by a 3-layer perceptron with
ReLU activation function and hidden dimension d, and a linear projection layer.
The FFN predicts the normalized center coordinates, height and width of the box
w.r.t. the input image, and the linear layer predicts the class label using a
softmax function.
Since we predict a fixed-size set of N bounding boxes, where N is usually much
larger than the actual number of objects of interest in an image, an additional special
class label ∅ is used to represent that no object is detected within a slot. This class plays
a similar role to the “background” class in the standard object detection approaches.

The encoder is important for disentangling objects. The encoder seems to


separate instances already, which likely simplifies object extraction and
localization for the decoder. More the number of encoder layers, the better.
The positional embeddings in the encoder help with encoding information
about the position of various objects in the image.

Importance of positional encodings: There are two kinds of positional


encodings in our model: (1) spatial positional (of image features) encodings (in
sinusoidal) and (2) output positional encodings (object queries).

The inputs to the decoder are object-query embeddings (used for conditioning
information - each of these object queries can be thought of as a single question
on whether an object is in a certain region. This also means that each object
query represents how many objects that the model can detect), which are
automatically learned during training. These act as the query matrices for all the
decoder layers. Similarly, for every layer, the key and query matrices are going to
be the final output matrix from the encoder block, replicated twice.
In the decoder, you can find attention layers where the query, key, and
values aren’t taken from the same source, but rather only the key and value
tensors are taken from the encoder whereas the query tensor is taken from the
previous decoder layer. The attention layer otherwise works exactly the same
way, only with that slight modification in inputs. This integrates the information
from the encoder into the decoder, i.e. it asks the question “how do the words
(classes) in the input word sequence relate to the words (classes) that have been
outputted so far”.

258
For the decoder in DETR, it does not make sense to start off its input with a start
token. In fact, object detection does not need to detect objects in sequence, so
there is no need to continuously concatenate the output to the next input in the
decoder.
Instead, a fixed number of trainable inputs (in DETR, 100) are used for the
decoder called object queries.
Object queries are learnt positional encodings. The N object queries are
transformed into an output embedding by the decoder. They are then
independently decoded into box coordinates and class labels by a feed forward
network (FFN), resulting in N final predictions.
Using self- and encoder-decoder attention over these embeddings, the
model globally reasons about all objects together using pairwise relations
between them, while being able to use the whole image as a context.

Training settings for DETR differ from standard object detectors in multiple ways.
The new model requires an extra-long training schedule and benefits from
auxiliary decoding losses in the transformer.

Sample code for illustrating DETR Model:

from collections import OrderedDict


class DETR(nn.Module):
def __init__(self,num_classes,hidden_dim=256,nheads=8,
num_encoder_layers=6, num_decoder_layers=6):
super().__init__()
self.backbone = resnet50()
layers = OrderedDict()
for name,module in self.backbone.named_modules():
if name in ['conv1','bn1','relu','maxpool', 'layer1','layer2','layer3','layer4']:
layers[name] = module
self.backbone = nn.Sequential(layers)
self.conv = nn.Conv2d(2048, hidden_dim, 1)
self.transformer = nn.Transformer(hidden_dim, nheads,
num_encoder_layers, num_decoder_layers)
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)

259
In the preceding code, we are specifying the following:
The layers of interest in sequential order (self.backbone)
The convolution operation (self.conv)
The transformer block (self.transformer)
The final connection to obtain the number of classes (self.linear_class)
The bounding box (self.linear_box)

Define the positional embeddings for the encoder and decoder layers:

self.query_pos = nn.Parameter(torch.rand(100, hidden_dim)) # 100 output


embeddings, each of size “hidden_dim”.
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

self.query_pos is the positional embedding input for the decoder layer, whereas
self.row_embed and self.col_embed form the two-dimensional positional
embeddings for the encoder layer.

Define the forward method:

def forward(self, inputs):


x = self.backbone(inputs)
h = self.conv(x)
H, W = h.shape[-2:]
'''Below operation is rearranging the positional embedding vectors for
encoding layer'''
pos = torch.cat([self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),], dim=-1).flatten(0,
1).unsqueeze(1)
'''Finally, predict on the feature map obtained from resnet using the
transformer network'''
h = self.transformer(pos+0.1*h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1)).transpose(0, 1)
'''post process the output `h` to obtain class probability and bounding
boxes'''

260
return { 'pred_logits': self.linear_class(h), 'pred_boxes':
self.linear_bbox(h).sigmoid() }

ULTRALYTICS RT-DETR:

Ultralytics provides a version of DETR by the name RT-DETR (Realtime DETR)


that provides higher accuracy & speed than DETR (& YOLO), using IoU aware
object queries:
https://docs.ultralytics.com/models/rtdetr/
Paper: https://arxiv.org/abs/2304.08069
Model YAML:
https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/rt-detr/rt-detr-
l.yaml (for RT-DETR-Large model)

NOTE: Ultralytics RT-DETR requires PyTorch version 2.0 or higher.

Code:

# train_model() is used to train a YOLO RT-DETR model with raw or pretrained


weights(transfer learning), on bus-Truck dataset.
def train_model(epochs=10):
model = RTDETR("rtdetr-l.pt") # load pretrained weights. will download the first
time.
# Display model information (optional)
#model.info()
print("\nTraining started\n")
yaml_path = os.getcwd() + "data.yaml" # set with absolute path, as YOLO seems to
cache previous paths.
model.train(data=yaml_path, epochs=epochs, amp=False) # set flag {val=False} to
not perform validation during training.
print("\nTraining complete.\n")

Prediction/Inference code is the same as that in YOLO object detection code.

261
KEYPOINT DETECTION:

Keypoint detection, also known as keypoint localization or landmark detection, is


a computer vision task that involves identifying and localizing specific points of
interest in an image. Ex: Keypoints represent human body joints, facial
landmarks, or salient points on objects.
Keypoint detection provides essential information about the location, pose, and
structure of objects or entities within an image, playing a critical role in computer
vision applications such as: Pose estimation, Object detection and tracking,
Facial analysis, Augmented reality, etc.

(1) Code for Pose/Keypoint Detection using ViT as backbone:

This project uses ViT as a backbone, with custom head for detection 36
keypoints (36*2(for x,y) = 72 outputs) for Left ventricle wall detection in
echocardiogram frames.

Model:

# ViT & ViTPose pytorch links:


# https://github.com/lucidrains/vit-pytorch

# from Pytorch: https://pytorch.org/vision/main/models/vision_transformer.html

# https://github.com/gpastal24/ViTPose-Pytorch

import torch
import torchvision
import torch.nn as nn
from torchvision.models.vision_transformer import vit_b_16

class ViTPose(nn.Module):
def __init__(self, num_kp: int, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.model_name = "ViTPose.pth"
# use vit_b_16 as backbone.

262
# can also use weights=ViT_B_16_Weights.DEFAULT to get the most up-to-date
weights.
self.backbone = vit_b_16(torchvision.models.ViT_B_16_Weights)
# freeze weights.
for params in self.backbone.parameters():
params.requires_grad = False
# use "num_kp" as number of keypoints.
# (1) try changing the default head itself.
# self.backbone.heads = nn.Linear(in_features=768, out_features=num_kp,
bias=True)
# (2) try adding a new keypoint head.
self.backbone.heads = nn.Sequential(nn.Linear(in_features=768,
out_features=400, bias=True), nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.4),
nn.GELU(), nn.Linear(in_features=400, out_features=num_kp, bias=True), nn.GELU())
# (3) 3 layer head.
# self.backbone.heads = nn.Sequential(nn.Linear(in_features=768,
out_features=2048, bias=True), nn.BatchNorm1d(num_features=2048), nn.GELU(),
nn.Linear(in_features=2048, out_features=1024, bias=True),
nn.BatchNorm1d(num_features=1024), nn.GELU(), nn.Linear(in_features=1024,
out_features=num_kp, bias=True), nn.ReLU())
# self.backbone.heads = nn.Sequential(nn.Linear(in_features=768, out_features=400,
bias=True), nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.4), nn.GELU(),
nn.Linear(in_features=400, out_features=400, bias=True),
nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.6), nn.GELU(),
nn.Linear(in_features=400, out_features=num_kp, bias=True), nn.ReLU())

def forward(self, input):


output = self.backbone(input)
# output = self.keypoints_head(output)
return output

Dataset (synthetic camus):

import torch
import torchvision
import glob
import cv2
from torch.utils.data import Dataset
from torchvision import transforms

device = "cuda" if torch.cuda.is_available() else "cpu"

263
# this dataset is for synthetic camus dataset files, that have been stored in YOLO
format (.jpgs & corresponding annotation .txt files).
class CamusDataset(Dataset):
# the preprocessing as explained at:
https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html#torc
hvision.models.vit_b_16
preprocess = transforms.Compose([transforms.Resize(240,
interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(224),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

def __init__(self, path) -> None:


super().__init__()
self.path = path
self.files = glob.glob(path + '*.jpg')
#self.preprocess =
torchvision.models.ViT_B_16_Weights.IMAGENET1K_V1.transforms
return

def __len__(self):
return len(self.files)

# return the input image & it's kp data, at 'index'.


def __getitem__(self, index):
input_filename = self.files[index] # get data filename.
# load and prepare input image in (C,H,W) format.
in_image = torch.tensor(cv2.imread(input_filename), dtype=torch.float)
in_image = (in_image/255).permute(2,0,1).to(device)
in_image = CamusDataset.preprocess(in_image)
#in_image = in_image.to(device)
gt_filename = input_filename.replace(".jpg", ".txt") # get annotation file
name.
# load & prepare keypoint data.
data = None
with open(gt_filename) as f:
data = f.read()
data = data.split(" ")
data = data[5:] # get only keypoint data.
# load kp data in tensor.
data = list(map(float, data))
kp_data = torch.tensor(data, dtype=torch.float, device=device)
return in_image, kp_data

264
Train-Validate:

import torch
import torch.nn as nn
import cv2

# performs training
def train(epochs, dTrainLoader, dValidateLoader, model, lossFn, optimFn, scheduler):
iters = len(dTrainLoader)
for epoch in range(epochs):
model.train()
train_epoch_loss = 0
loss_scaler = 1 # to scale the loss magnitude, for heavy penalty.
for i, (input_images, kp_gt) in enumerate(dTrainLoader):
optimFn.zero_grad()
prediction_kp = model(input_images) # perform inference.
# compute loss.
kp_loss = lossFn(prediction_kp, kp_gt) * loss_scaler
kp_loss.backward()
optimFn.step()
train_epoch_loss += kp_loss.item() # add to overall loss.
# use scheduler step.
if scheduler is not None:
scheduler.step()
#scheduler.step(epoch + i/iters)
# perform validation after every epoch.
total_val_loss = validate(dValidateLoader, model, lossFn)
# print average training loss.
print(f"Epoch {epoch+1}: Training Loss: {train_epoch_loss:0.3f} , Validation Loss:
{total_val_loss:0.3f}")
return

def validate(dValidateLoader, model, lossFn):


model.eval()
total_val_loss = 0
for (input_images, kp_gt) in dValidateLoader:
predictions = model(input_images)
kp_loss = lossFn(predictions, kp_gt)
total_val_loss += kp_loss.item()

265
#print(f"Validation loss: {total_val_loss}.")
return total_val_loss

device = "cuda" if torch.cuda.is_available() else "cpu"

# visualizes predictions in blue.


def predict(image_name, model, preprocess):
# load image & preprocess it.
image = cv2.imread(image_name)
in_image = torch.tensor(image/255, dtype=torch.float).permute(2,0,1).to(device)
in_image = preprocess(in_image).unsqueeze(dim=0)
# perform inference.
pred_kps = model(in_image)
# plot the kps.
pred_kps = pred_kps.view(-1, 2) # reshape from [72,1] to [36,2] for easier
accessibility.
img_scaler = [image.shape[1], image.shape[0]]
pred_kps *= torch.tensor(img_scaler, dtype=torch.int)
pred_kps = pred_kps.to(dtype=torch.int)
for (x,y) in pred_kps:
x = x.item()
y = y.item()
image = cv2.circle(image, (x,y), 2, (255,0,0), 2)
# visualize GT.
image = plot_gt(image_name, image)
# display image
cv2.imshow("w", image)
cv2.waitKey()
return

# visualizes GT in green.
def plot_gt(filename, image):
data = None
filename = filename.replace(".jpg", ".txt")
with open(filename) as f:
data = f.read()
data = data.split(" ")
data = data[5:] # get only keypoint data.
data = list(map(float, data))
# process GT wrt image dimensions.

266
gt_data = torch.tensor(data, dtype=torch.float).to(device)
gt_data = gt_data.view(-1, 2)
img_scaler = torch.tensor([image.shape[1], image.shape[0]], dtype=torch.int)
gt_data *= img_scaler
gt_data = gt_data.to(dtype=torch.int)
for (x,y) in gt_data:
x = x.item()
y = y.item()
image = cv2.circle(image, (x,y), 2, (0,255,0), 2)
return image

Main:

import torch
import torch.nn as nn
import torch.optim as optim
from train_validate import train, predict
from ViTPose import ViTPose
from CamusDataset import CamusDataset
from torch.utils.data import DataLoader
import os, glob

device = "cuda" if torch.cuda.is_available() else "cpu"

# path to save model to.


save_model_path = os.getcwd() + "/saved_models/"

def run_training(path, epochs=10):


train_path = path# + "train/"
val_path = path# + "val/"
train_dataset = CamusDataset(train_path)
val_dataset = CamusDataset(val_path)
batch_s = 10
dTrainLoader = DataLoader(train_dataset, batch_size=batch_s, shuffle=True,
drop_last=True)
dValidateLoader = DataLoader(val_dataset, batch_size=batch_s, shuffle=True,
drop_last=True)
model = ViTPose(72).to(device=device) # create the model with 36 keypoints
(36*2=72 i.e. (x,y) info).
# create loss & optim functions.
lossFn = nn.MSELoss()
#lossFn = nn.L1Loss()

267
# optimizer with L2 regularization (weight_decay).
optimFn = optim.Adam(model.parameters(), lr=0.0005, weight_decay=1e-4)
#no_of_batches_per_epoch = len(train_dataset) / batch_s
scheduler = torch.optim.lr_scheduler.LinearLR(optimFn, start_factor=1,
end_factor=0.001, total_iters=epochs)
#scheduler = None #torch.optim.lr_scheduler.CyclicLR(optimFn, 0.00005, 0.0005,
step_size_up=no_of_batches_per_epoch/4, step_size_down=no_of_batches_per_epoch*4,
mode='exp_range', gamma=0.8, cycle_momentum=False)
#scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimFn, T_0=10,
T_mult=1, eta_min=0.00005, last_epoch=-1, verbose=False)

try:
train(epochs, dTrainLoader, dValidateLoader, model, lossFn, optimFn,
scheduler)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), save_model_path + model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), save_model_path + model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
return

#run_training("/Users/abhijitpoojary/Desktop/pixelbin_projects/datasets/synthetic-
camus-val/", 60)

def perform_prediction(image_name, model=None):


if model is None:
# create model & load weights.
model:ViTPose = ViTPose(72).to(device)
result = model.load_state_dict(torch.load(save_model_path + model.model_name,
map_location=torch.device('cpu') ))
model.eval()
predict(image_name, model, CamusDataset.preprocess)
return

268
(C) SEGMENTATION:

Image segmentation involves converting an image into a collection of


regions of pixels that are represented by a mask or a labeled image. By dividing
an image into segments, you can process only the important segments of the
image instead of processing the entire image.
Semantic(or class) segmentation cannot distinguish between different
instances in the same category, i.e. all chairs are marked blue. Instance
segmentation can distinguish between different instances of the same
categories, i.e. different chairs are distinguished by different colours.
Panoptic segmentation is a task that combines both semantic & instance
segmentation.

Architectures such as U-Net (for semantic) & Mask R-CNN (for instance) can be
used for image segmentation.

Usually in object detection, while predicting the class of an object and the
bounding box corresponding to the object, we pass the image through a network,
flatten the output at a certain layer, and connect additional dense layers before
making predictions for the class and bounding box offsets.
However, in the case of image segmentation, where the output shape is
the same as that of the input image's shape, flattening the convolutions' outputs
and then reconstructing the image might result in a loss of information.
Furthermore, the contours and shapes present in the original image will not
vary in the output image in the case of image segmentation.
The two aspects that we need to keep in mind while performing
segmentation are as follows:
- The shape and structure of the objects in the original image remain the
same in the segmented output.
- Leveraging a fully convolutional architecture (and not a structure where we
flatten a certain layer) can help here since we are using one image as input and
another as output.

Masks: Mask data comes in two formats:


(a) Binary masks (0s & 1s), specified as a logical array of size { H x W x
NumObjects } (1s in each channel indicates presence of one of the “NumObjects”

269
labels). Each mask is the segmentation of one instance in the image. Used in
Mask R-CNN.
(b) Polygon coordinates. Each row of the array contains the (x,y) coordinates
of a polygon along the boundary of one instance in the image. Used in YOLOv8
segmentation.

SEMANTIC SEGMENTATION:

UNet:

UNet, evolved from the traditional convolutional neural network, was first
designed and applied in 2015 to process biomedical images. The reason it is
able to localize and distinguish borders is by doing classification on every pixel,
so the input and output share the same size.

UNet Architecture.

270
In the left half (Encoder) of the preceding diagram, we can see that the image
passes through convolution layers, and that the image size keeps reducing while
the number of channels keeps increasing (to capture different details or features
in our image).
However, in the right half (Decoder), we can see that we are upscaling the
downscaled image, back to the original height and width but with as many
channels as there are classes.
In addition, while upscaling in the right half, we are also leveraging information
from the corresponding layers in the left half using skip connections so that we
can preserve the structure/objects in the original image.
NOTE: The difference between Residual connections & UNet skip
connections is that in Residual connections, we add the previous inputs, whereas

271
in U-Net, we concatenate the previous input signals along the channel
dimension.
This way, the U-Net architecture learns to preserve the structure (and
shapes of objects) of the original image while leveraging the convolution's
features to predict the classes that correspond to each pixel. In general, we have
as many channels in the output as the number of classes we want to predict.
The number of convolutions in the decoder should be of sufficient number,
in order to process the convolutions (i.e. feature maps generated) of the encoder,
in order to reconstruct the image/mask.

In the U-Net architecture, upscaling is performed using the nn.ConvTranspose2d


method (performs transposed convolution), which takes the number of input
channels, the number of output channels, the kernel size, and stride as input
parameters.
ConvTranspose2d input/output shape calculation:
https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html
Inputs: N, Cin, Hin, Win.
Outputs: N, Cout, Hout, Wout.

Hout = (Hin - 1) * stride[0] − 2*padding[0] + dilation[0] * (kernel_size[0] − 1) +


output_padding[0] + 1
Wout = (Win - 1) * stride[1] − 2*padding[1] + dilation[1]*(kernel_size[1] − 1) +
output_padding[1] + 1

Defining a loss function:

The most commonly used loss function for the task of image segmentation is a
pixel-wise cross entropy loss. This loss examines each pixel individually,
comparing the class predictions (depth-wise pixel vector) to our one-hot encoded
target vector.

272
U-net uses a loss function for each pixel of the image. This helps in easy
identification of individual cells within the segmentation map. Softmax is applied
to each pixel, followed by a loss function.
This converts the segmentation problem into a classification problem where we
need to classify each pixel to one of the classes.

Because the cross entropy loss evaluates the class predictions for each
pixel vector individually and then averages over all pixels, we're essentially
asserting equal learning to each pixel in the image. This can be a problem if your
various classes have unbalanced representation in the image, as training can be
dominated by the most prevalent class.
Class weights concepts can be used for weighting this loss for each output
channel in order to counteract a class imbalance present in the dataset.

Another popular loss function for image segmentation tasks is based on


the Dice coefficient (D), which is essentially a measure of overlap (similarity)
between two samples (prediction & GT). This measure ranges from 0 to 1 where
a Dice coefficient of 1 denotes perfect and complete overlap (i.e. perfect
prediction). The Dice coefficient was originally developed for binary data, and can
be calculated as:

273
where |A ∩ B| represents the common elements between sets A (predicted
mask/set of pixels) and B (ground truth/target mask/set of pixels), and |A| represents the
number of elements in set A (and likewise for set B). For the case of evaluating a Dice
coefficient on predicted segmentation masks, we can approximate |A ∩ B| as the
element-wise multiplication between the prediction and target mask, and then
sum the resulting matrix.
Dice is an F1 score.

When applied to Boolean data, using the definition of true positive (TP), false
positive (FP), and false negative (FN), it can be written as:

It is different from the Jaccard index which only counts true positives once in both
the numerator and denominator.

Ex function for computing Dice:


def DICE_COE(mask1, mask2):
intersect = np.sum(mask1*mask2)
fsum = np.sum(mask1)
ssum = np.sum(mask2)
dice = (2 * intersect ) / (fsum + ssum) # apply the above formula for Dice.
dice = np.mean(dice)
dice = round(dice, 3) # for easy reading
return dice

274
In order to formulate a loss function which can be minimized, we can
simply use (1 - Dice). This loss function is known as the Soft Dice loss because
we directly use the predicted probabilities instead of thresholding and converting
them into a binary mask.

A soft Dice loss is calculated for each class separately and then averaged
to yield a final score.
Ex Code:
# Soft dice loss calculation for arbitrary batch size, number of classes, and number of spatial
dimensions. Assumes the `channels_last` format.
def soft_dice_loss(y_true, y_pred, epsilon=1e-6):
# Arguments
y_true: b x X x Y( x Z...) x c One hot encoding of ground truth
y_pred: b x X x Y( x Z...) x c Network output, must sum to 1 over c channel (such as after
softmax)
epsilon: Used for numerical stability to avoid division by zero errors.

# skip the batch and class axis for calculating Dice score.
axes = tuple(range(1, len(y_pred.shape)-1))
numerator = 2 * np.sum(y_pred * y_true, axes)
denominator = np.sum(np.square(y_pred) + np.square(y_true), axes)

return 1 - np.mean((numerator + epsilon) / (denominator + epsilon)) # average


over classes and batch

275
The Tversky loss is another type of loss used with segmentation, & is based on
the Tversky index for measuring overlap between two segmented images, &
based on generalization of Dice loss. The Tversky index (TIc) between one image
Y and the corresponding ground truth T is given by:

Jaccard index:

The Jaccard index, also known as the Jaccard similarity coefficient, is a


statistic used for gauging the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between finite sample sets, and is
defined as the size of the intersection divided by the size of the union of the
sample sets:

276
By design, 0 <= J(A, B) <= 1.
NOTE: |A U B| (i.e. |A| + |B| - |A ∩ B|) is not the same as (|A| + |B|).

The Jaccard distance, which measures dissimilarity between sample sets, is


complementary to the Jaccard coefficient and is obtained by subtracting the
Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes
of the union and the intersection of two sets by the size of the union:

In confusion matrices employed for binary classification, the Jaccard index can
be framed in the following formula:

Key Differences:

1) The Dice coefficient tends to give more weight to the intersection of the
sets, as it uses the average size of the sets in the denominator. This makes it
more sensitive to smaller overlaps.
2) The Jaccard index, on the other hand, gives equal weight to both the
intersection and union of the sets. It is sometimes preferred when you want a
more balanced measure of similarity.
3) In some applications, the Dice coefficient is considered to be more
stringent, as it may penalize small differences more heavily compared to the
Jaccard index.
4) The Dice coefficient is often used in the context of binary data, such as
pixel-wise segmentation in images, where each pixel is either part of the region
of interest or not. The Jaccard index is more versatile and can be applied to a
wider range of data types and applications.

277
In summary, both the Dice coefficient and the Jaccard index are measures of set
similarity, but they differ in how they combine the sizes of the intersection and
union of sets, making them suitable for different contexts and applications.

Conversion between Dice & Jaccard:

D = 2J / (1 + J).
J = D / (2 - D).

Accuracy:

Accuracy is a more general metric used for classification tasks. It measures the
ratio of correctly predicted instances (both positive & negative) to the total
number of instances.
Accuracy is useful when all classes in a classification problem are of equal
importance and have a similar distribution in the dataset.
Accuracy provides an overall measure of how well a model is performing
across all classes. It is a useful metric when classes are balanced, but it can be
misleading when dealing with imbalanced datasets because a model can achieve
high accuracy by simply predicting the majority class.

Accuracy = (TP + TN) / (TP + TN + FP + FN).


where TP is true positives, TN is true negatives, FP is false positives, and
FN is false negatives.

Accuracy takes into account TN too, whereas Precision, Recall, Dice & Jaccard
only consider TP (i.e. evaluate the quality of positive predictions).

In practice, it's often a good idea to consider multiple evaluation metrics,


especially in cases where both positive and negative predictions are important.
This can give you a more comprehensive view of your model's performance.
Additionally, the choice of metric should align with the specific goals and
requirements of your application.

278
1) Code for Semantic Segmentation with UNet on Road Dataset:

Dataset:

NOTE: The output/label/mask images contain integers that range between


[0,11]. This indicates that there are 12 different classes.

import os
from torchvision import transforms
from torch.utils.data import Dataset
import torch, cv2
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'

class RoadDataset(Dataset):
def __init__(self, path, isTrain) -> None: # path = "../dataset"
super().__init__()
resizeValue = 224
self.path = path
self.preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
self.data = [] # holds the data(input images/segmentation masks) names,
without extensions.
# set folder names to pick data from, depending on training or validation.
self.images_foldername = ""
self.seg_foldername = ""
if isTrain:
self.images_foldername = "/images_prepped_train/"
self.seg_foldername = "/annotations_prepped_train/"
else:
self.images_foldername = "/images_prepped_test/"
self.seg_foldername = "/annotations_prepped_test/"
filepath = path + self.images_foldername
# get list of all images in train or validation folder.
self.data = [os.path.splitext(filename)[0] for filename in
os.listdir(filepath)]
#print(len(self.data))

def __len__(self):

279
return len(self.data)

# return the input image & it's mask(image) as 2 tensors.


def __getitem__(self, index):
filename = self.data[index]
image = cv2.imread(self.path + self.images_foldername + filename + ".png")
# convert channel ordering & resize input image.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (224, 224)).astype(np.float32)
image = (image/255) # convert to scale 0-1.
image = self.preprocess(image) # preprocess input image.
# load mask as single channel image. image is loaded in ordering [HxW] by
default in cv2.
seg_mask = cv2.imread(self.path + self.seg_foldername + filename + ".png",
cv2.IMREAD_GRAYSCALE)
seg_mask = cv2.resize(seg_mask, (224, 224))
# convert mask to tensor, type long().
seg_mask = torch.from_numpy(seg_mask).long()
return image.to(device), seg_mask.to(device)

Model:

import torch
import torch.nn as nn
from torchvision.models import vgg16_bn, VGG16_BN_Weights

"""
VGG16-BN summary:

VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(5): ReLU(inplace=True) # block1 (2 Convs).

(6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

280
(7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(9): ReLU(inplace=True)
(10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(12): ReLU(inplace=True) # block2 (2 Convs, 1
maxpool).

(13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)


(14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(16): ReLU(inplace=True)
(17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(19): ReLU(inplace=True) # block3 (2 Convs, 1
maxpool).

(20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))


(21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(22): ReLU(inplace=True)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(26): ReLU(inplace=True) # block4 (2 Convs, 1
maxpool).

(27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))


(28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(29): ReLU(inplace=True)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(32): ReLU(inplace=True)
(33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# block5 (2 Convs, 1 maxpool).

281
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(36): ReLU(inplace=True)
(37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(39): ReLU(inplace=True)
(40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(42): ReLU(inplace=True)
(43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# bottleneck.
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)

"""

class RoadSegmentationModel(nn.Module):
def __init__(self, out_channels) -> None:
super().__init__()
self.model_name = "trained_models/RoadSegmentationModel.pth"
# use 'features' block from vgg16_bn pretrained model as convolutions providing
feature maps.
self.encoder = vgg16_bn(weights=VGG16_BN_Weights).features
self.down_block1 = nn.Sequential(*self.encoder[:6]) # from 0-5.
self.down_block2 = nn.Sequential(*self.encoder[6:13]) # from 6-12.
self.down_block3 = nn.Sequential(*self.encoder[13:20])
self.down_block4 = nn.Sequential(*self.encoder[20:27])
self.down_block5 = nn.Sequential(*self.encoder[27:34])

282
# freeze weights.
#for param in self.parameters():
# param.requires_grad = False

self.bottleneck = nn.Sequential(*self.encoder[34:])
self.conv_bottleneck = self.conv(512, 1024) # output channels is 512 here,
see VGG16_BN summary.

# define the up conv units.


self.up_conv6 = self.up_conv(1024, 512) # stride=2, kernel size=2. Hout = (Hin-
1)*2 + 2, Wout = (Win-1)*2 + 2.
self.conv6 = self.conv(512 + 512, 512)
self.up_conv7 = self.up_conv(512, 256)
self.conv7 = self.conv(256 + 512, 256)
self.up_conv8 = self.up_conv(256, 128)
self.conv8 = self.conv(128 + 256, 128)
self.up_conv9 = self.up_conv(128, 64)
self.conv9 = self.conv(64 + 128, 64)
self.up_conv10 = self.up_conv(64, 32)
self.conv10 = self.conv(32 + 64, 32)
# final output convolution has "out_channels" channels, equal to number of
classes to segment.
self.conv11 = nn.Conv2d(32, out_channels, kernel_size=1)

# a unit that performs convolution.


def conv(self, in_channels, out_channels):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)

# a unit that performs upscaling of image, via transposed convolution. output image
size depends on kernel size & stride values, among other (default value - not set
here) variables.
# ConvTranspose2D output image dimensions formula:
# input width, height: Win, Hin. output width, height: Wout, Hout.
# Hout = (Hin−1)×stride[0] − 2×padding[0] + dilation[0] × (kernel_size[0]−1) +
output_padding[0] + 1

283
# Wout =(Win−1)×stride[1] − 2×padding[1] + dilation[1] × (kernel_size[1]−1) +
output_padding[1] + 1
# stride=2, kernel size=2, padding=0, dilation=1, output_padding=0.
# Hout = (Hin-1)*2 - 2*0 + 1 * (2-1) + 0 + 1
# = (Hin-1)*2 - 0 + 1 * 1 + 0 + 1
# = (Hin-1)*2 + 2
# similarly for Wout.
def up_conv(self, in_channels, out_channels):
return nn.Sequential(
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2),
nn.ReLU(inplace=True)
)

# a block that performs upscaling og input image.


def up_block(self, input, cat_input, up_conv_unit, conv_unit):
input = up_conv_unit(input) # perform transposed convolution.
# inputs are CONCATENATED here (not added, as in Residual connections), along
the channel dimension(dim=1) (shape = [N,C,H,W]).
input = torch.cat([input, cat_input], dim=1)
input = conv_unit(input) # UNet architecture has as many convs() in
up_block(), as in it's corresponding down_block().
return input

def forward(self, input):


# move downwards, performing convolutions.
block1 = self.down_block1(input) # down block 1.
block2 = self.down_block2(block1) # down block 2.
block3 = self.down_block3(block2) # down block 3.
block4 = self.down_block4(block3) # down block 4.
block5 = self.down_block5(block4) # down block 5.

bottleneck = self.bottleneck(block5)
input = self.conv_bottleneck(bottleneck)

# move upwards, performing upscaling via transposed convolutions. Also,


previous inputs(blockX's) from down-blocks are added to corresponding up-blocks.
input = self.up_block(input, block5, self.up_conv6, self.conv6) # up block
1.
input = self.up_block(input, block4, self.up_conv7, self.conv7) # up block
2.

284
input = self.up_block(input, block3, self.up_conv8, self.conv8) # up block
3.
input = self.up_block(input, block2, self.up_conv9, self.conv9) # up block
4.
input = self.up_block(input, block1, self.up_conv10, self.conv10) # up
block 5.

input = self.conv11(input)
return input

Train/Validate:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim

device = "cuda" if torch.cuda.is_available() else "cpu"

def train(epochs, model:nn.Module, dLoader:DataLoader, lossFn, optimFn:optim):


model.train()
for epoch in range(epochs):
for _, [input_images, label_masks] in enumerate(dLoader):
optimFn.zero_grad()
pred_masks = model(input_images) # returned prediction is mask
image/s.

loss, accuracy = lossFn(pred_masks, label_masks)


loss.backward()
optimFn.step()
print(f"Epoch={epoch+1} => Loss={loss.item():0.3f},
Accuracy={accuracy.item():0.3f}.")
print("\n")

def validate(model:nn.Module, dLoader:DataLoader, lossFn, optimFn:optim):


model.eval()
with torch.no_grad():
for _, [input_images, label_masks] in enumerate(dLoader):

285
optimFn.zero_grad()
pred_masks = model(input_images)
# compute loss and accuracy.
loss, accuracy = lossFn(pred_masks, label_masks)
print(f"Validation per Batch => Loss={loss.item():0.3f},
Accuracy={accuracy.item():0.3f}")
print("\n")

main:

import torch
import torch.nn as nn
from RoadDataset import RoadDataset
from torch.utils.data import DataLoader
from RoadSegmentationModel import RoadSegmentationModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2
from torchvision import transforms

device = "cuda" if torch.cuda.is_available() else "cpu"

# define loss function for UNet based model.


ce_lossFn = nn.CrossEntropyLoss()

def UNetLoss(pred_prob_masks, label_masks):


# prediction channels=12, label channels=1(grayscale). cross entropy can convert 1-
channel target (0-11) values to one-hot encoding(12 channels), to make it's channels
same as that pf predictions. predictions contain 12 channels, with each pixel of a
channel representing the probability of that pixel belonging to that channel(i.e.
class).
ce_loss = ce_lossFn(pred_prob_masks, label_masks)
# get max values along dim=1(channels). "prediction" shape=(NxCxHxW). Output of
torch.max() will convert C channels into 1 channel, that contain max values.

286
torch.max() returns these max values[0] & their indices[1] - indices are the channel
numbers i.e. 0-11.
_, preds_indices_masks = torch.max(pred_prob_masks, 1)
# compute accuracy, based on similarity of pixel values between prediction & label
pixel values.
accuracy = (preds_indices_masks == label_masks).float().mean()
return ce_loss, accuracy

classes = 12 # output classes.

def run_training(epochs = 20):


dataset = RoadDataset("../dataset", True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=10, shuffle=True, drop_last=True)
model = RoadSegmentationModel(classes).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
lossFn = UNetLoss # create a custom loss function.
try:
print("\n")
print("------------Training Started-----------------")
train(epochs, model, dLoader, lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

def run_validation():
dataset = RoadDataset("../dataset", False)
dLoader = DataLoader(dataset, batch_size=10, shuffle=False, drop_last=True)
model = RoadSegmentationModel(classes).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
lossFn = UNetLoss
result = model.load_state_dict(torch.load(model.model_name))
print("\n")

287
print("------------Validation Started-----------------")
validate(model, dLoader, lossFn, optimFn)
print("------------Validation Complete-----------------")
print("\n")

# to resume training.
def resume_training(epochs = 20):
dataset = RoadDataset("../dataset", True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=10, shuffle=True, drop_last=True)
model = RoadSegmentationModel(classes).to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
optimFn = optim.Adam(model.parameters(), lr=0.0005)
lossFn = UNetLoss
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")

# converts tensor to opencv image format(numpy array) by rearranging dimensions before


conversion.
def convert_tensor_image_to_opencv(tensor_img):
# 'tens_img' format = CxHxW (if using transforms.ToTensor()).
tensor_img = torch.permute(tensor_img,dims=(1,2,0))
tensor_img *= 255
return tensor_img.numpy().astype(np.uint8).copy() # format = HxWxC (1,2,0), as
required by cv2 images.

288
# preprocess input image before prediction.
def preprocess_image_for_prediction(image):
# use COLOR_BGR2RGB here, for use as input in model.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
image = preprocess(image)
return image

# performs prediction on a single image. Displays both label & predicted mask for
performance comparison.
def perform_prediction(path, img_path, label_path, input_filename):
# Create a new folder with 1 test image & it's label mask, then use RoadDataset to
point to this folder.
# read input image.
input_image = cv2.imread(path + img_path + input_filename)
input_image = cv2.resize(input_image, (224, 224)).astype(np.float32)
input_image = (input_image/255) # if we don't do this, entire image is
displayed as white color in cv.imshow(). No need to use COLOR_BGR2RGB here, as
cv.imshow() handles it internally.
# create & load trained model.
model = RoadSegmentationModel(classes).to(device)
result = model.load_state_dict(torch.load(model.model_name))
# preprocess input image as a separate tensor, to be fed into model.
preprocessed_input_image = preprocess_image_for_prediction(input_image)
# run inference.
predicted_mask = model(preprocessed_input_image.unsqueeze(0)) # add batch
dimension to input.
_, predicted_mask = torch.max(predicted_mask, dim=1) # merge 12 channel image to
1 channel image.
cv2.imshow("Input Image", input_image)
cv2.moveWindow("Input Image", 10, 20) # set window position.
# load input image's label mask & preprocess it, if it exists.
label_mask = None
value_multiplier = int(255/11) # apply this on segment mask, for
visualization purpose.
# LABEL image is used only for display purpose.
label_mask = cv2.imread(path + label_path + input_filename, cv2.IMREAD_GRAYSCALE)
if label_mask is not None:
label_mask = cv2.resize(label_mask, (224, 224))

289
label_mask = torch.from_numpy(label_mask).long()
label_mask *= value_multiplier # modify mask values from range 0-11 to 0-255.
(255/11=23)
cv2.imshow("Label Mask", label_mask.numpy().astype(np.uint8).copy())
cv2.moveWindow("Label Mask", 250, 20)
# modify prediction mask values from range 0-11 to 0-255.
predicted_mask *= value_multiplier
# convert predicted_mask from shape=[1, 224, 224] to shape=[224, 224]
predicted_mask = torch.squeeze(predicted_mask)
cv2.imshow("Predicted Mask", predicted_mask.numpy().astype(np.uint8).copy())
cv2.moveWindow("Predicted Mask", 500, 20)
cv2.waitKey()
print("Inference complete.\n")

#run_training(20) # Epoch=20 => Loss=0.264, Accuracy=0.929.

#resume_training(10)

#run_validation() # Validation per Batch => Loss=0.360, Accuracy=0.892

perform_prediction("../dataset/", "images_prepped_test/", "annotations_prepped_test/",


"0016E5_07959.png")
#perform_prediction("../dataset/", "images_prepped_test/",
"annotations_prepped_test/", "0016E5_08157.png")
#perform_prediction("../dataset/", "images_prepped_test/",
"annotations_prepped_test/", "0016E5_08135.png")

#perform_prediction("../dataset/", "images_prepped_train/",
"annotations_prepped_train/", "0006R0_f02100.png") # from training set.

INSTANCE SEGMENTATION:

Mask R-CNN:

290
The Mask R-CNN architecture helps in identifying/highlighting the instances of
objects of a given class within an image; & is an extension of the Faster R-CNN,
with the following modifications:
- The RoI Pooling layer has been replaced with the RoI Align layer.
- A mask head has been included to predict a mask of objects in addition to
the head, which already predicts the classes of objects and bounding box
correction in the final layer.
- A fully convolutional network (FCNN) is leveraged for mask prediction.
FCNN is a network (used mainly for semantic segmentation) that does not
contain any “Dense” layers (as in traditional CNNs with Fully connected
(flattened) layers); instead it contains 1x1 convolutions that perform the task of
fully connected layers (Dense layers). They employ solely locally connected
layers, such as convolution, pooling and upsampling.
No dense layers means less parameters to deal with.

Working:

In the preceding diagram, note that we are fetching the class and bounding box information
from one layer and the mask information from another layer.

291
Mask-RCNN architecture.

RoI Align:

One of the drawbacks of RoI Pooling in Faster R-CNN is that we are


likely to lose certain information when we are performing the RoI pooling
operation. This is because we are likely to have an even representation of
content across all the areas of an image before pooling; because, while
converting input to fixed output size (quantization), one part of the region has
less representation compared to other parts of the region (when picking up
values); resulting in information loss.
Following diagrams show how RoI Align works:
Ex: trying to convert the following region (which is represented in dashed
lines) into a 2 x 2 shape:

292
Note that the region (in dashed lines) is not equally spread across all the cells in
the feature map.
We must perform the following steps to get a reasonable representation of the
region in a 2 x 2 shape:
(1) First, divide the region into an equal 2 x 2 shape:

(2) Define four points that are equally spaced within each of the 2 x 2 cells:

293
The distance between two consecutive points is 0.75.

(3) Calculate the weighted average value of each point based on its distance
to the nearest known value (bilinear interpolation):

(4) Repeat the preceding interpolation step for all four points in a cell:

(5) Perform average pooling across all four points within a cell:
Ex: (0.21778 + 0.27553 + 0.14896 + 0.21852)/4 = 0.86079/4 = 0.21519

294
By implementing the preceding steps, we don't lose out on information when
performing RoI Align; that is, when we place all the regions inside the same
shape.
Using RoI Align, we can get a more accurate representation of the region
proposal that is obtained from the Region Proposal Network.

Mask Head:

Typically, in the case of object detection, we would pass the RoI Align through a
flattened layer in order to predict the object's class and bounding box offset.
However, in the case of image segmentation, we predict the pixels within a
bounding box that contains the object. Hence, we now have a third output (apart
from class and bounding box offset), which is the predicted mask within the
region of interest.
Here, we are predicting the mask, which is an image overlay on top of the
original image. Given that we are predicting an image, instead of flattening the
RoI Align's output, we'll connect it to another convolution layer/s to get another
image-like structure.

The label masks for instance segmentations are in the form of binary masks i.e.
Height x Width x NumberOfInstances:
Height = height of the mask image (& input data image)
Weight = Weight of the mask image (& input data image)
NumberOfInstances = number of instances in input image.

2) Code for Instance Segmentation with Yolov8 on Underwater


Dataset:

295
mask_to_polygon.py (converts mask from image into x,y points; as required by
YOLO):

import os
import cv2
import numpy as np
import shutil

# indentation: 4 spaces.

classes_names = ["background", "human divers", "plants", "ruins", "robots", "reefs &


invertebrates", "fish & vertebrates", "sea floor & rocks"] # 8 classes.
classes_codes = [[0,0,0], [0,0,255], [0,255,0], [0,255,255], [255,0,0], [255,0,255],
[255,255,0], [255,255,255]] # contains the RGB codes for class masks.

# The "convert_label_mask_to_yolo_format()"" helper function is used to convert


masks in image format, into points/polygons format in a text file, along with their
class indices; as required by YOLO; for training.
# This is a 1 time step, used only when dataset is to be used the first time.

# separates individual class masks from the RGB mask image. also returns the
individual class indices.
def separate_underwater_mask(rgb_mask, H, W):
classes = []; masks = []
gray_mask = cv2.cvtColor(rgb_mask, cv2.COLOR_RGB2GRAY) # convert rgb mask to
grayscale.

for index in range(len(classes_codes)):


#if index == 0: # ignore background.
# continue

# convert class rgb code to grayscale code.


code = np.array(classes_codes[index], dtype=rgb_mask.dtype) # get class
code for index.
mask = np.zeros((1, 1, 3), dtype=rgb_mask.dtype)
mask = code
mask = mask.reshape(1, 1, 3) # convert code into image type structure for
conversion to grayscale.
gray_code = cv2.cvtColor(mask, cv2.COLOR_RGB2GRAY) # convert rgb code to
grayscale.

296
# get the class mask for current class, if any.
class_mask = np.zeros((H, W), dtype=rgb_mask.dtype) # get a blank mask.
diff_mask_indices = (gray_mask == gray_code[0][0]) # get indices of pixels
matching out class mask code.
class_mask[diff_mask_indices] = 255 # set all such matched masks to highest
value.

# debugging only.
#cv2.imshow("class mask", class_mask)
#cv2.waitKey()

# append only if mask was detected for the current class mask code.
if class_mask.max() != 0:
classes.append(index)
masks.append(class_mask)

return classes, masks

# given a mask image, returns the equivalent polygon points.


def convert_mask_to_polygons(mask, H, W):
# get only external contours per object.
contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
# convert the contours to polygons.
# For each i-th contour i.e. contours[i], hierarchy[i][0] points to next contour at
the same hierarchical level.
polygons = [] # contains points for all contours in mask.
for cnt in contours:
if cv2.contourArea(cnt) > 200: # checking if area is more than some
minimum acceptable number.
polygon = [] # contains all points for a given contour.
for point in cnt:
x, y = point[0]
# (x,y) points relative to entire image.
norm_x = x / W
norm_y = y / H
polygon.append((norm_x, norm_y))
polygons.append(polygon)
return polygons

297
# writes points info into file 'f'.
def write_points_in_file(polygons:list, class_index, f):
for polygon in polygons: # for each new contour in list of contours, add a
new entry of it's class index with points info.
f.write(f"{class_index} ")
for p in polygon: # write a single contour points info.
#f.write('{} '.format(p))
f.write(f'{p[0]} {p[1]} ')
f.write('\n') # put "\n" at end of each contour.
return

# convert raw mask (label) images, to a file containing { <class> x1 y1 x2 y2...}


segmentation points format; as required by YOLO. This conversion is for "Underwater"
dataset.
def convert_label_mask_to_yolo_format(input_dir, output_dir):
if os.path.exists(out_dir):
shutil.rmtree(out_dir) # delete any existing (empty/non-empty) directory.
os.mkdir(out_dir) # create folder, if it does not exist.

for j in os.listdir(input_dir):
image_path = os.path.join(input_dir, j)

if os.path.isdir(image_path):
continue

# read input image.


mask = cv2.imread(image_path)
#print(image_path)
if mask is None:
continue

mask = cv2.cvtColor(mask, cv2.COLOR_BGR2RGB)


H, W, _ = mask.shape
# (1) separate all masks in the image, into separate masks & their
corresponding classes.
classes, masks = separate_underwater_mask(mask, H, W)

# (2) print the points for all class masks in the mask image, into a file.
with open('{}.txt'.format(os.path.join(output_dir, j)[:-4]), 'w') as f:
# for each class mask in a given label, generate & store it's points data.

298
for index, mask in enumerate(masks):
# each mask image contains only one object(mask).
polygons = convert_mask_to_polygons(mask, H, W)
# (1) add class number here, in case of multiclass segmentation.
write_points_in_file(polygons, classes[index], f) # write points
info in file, for a single class mask.
f.close()
print("Conversion complete")

# TRAINING DATASET.
#in_dir = "../dataset/underwater/train_val/masks" # path to input masks
(images).
#out_dir = "../dataset/underwater/train_val/labels" # path where output mask
points (text files) should go. output folder should exist beforehand.

# TEST DATASET.
in_dir = "../dataset/underwater/test/masks"
out_dir = "../dataset/underwater/test/labels"

convert_label_mask_to_yolo_format(in_dir, out_dir)

data.yaml:

train: /Users/abhijitpoojary/Desktop/PyTorchVSCode/Segmentation/YOLO8_Segmentation/
dataset/underwater/train_val/images # Ultralytics seems to store the previous path
somewhere, hence the need to provide absolute path.

val: /Users/abhijitpoojary/Desktop/PyTorchVSCode/Segmentation/YOLO8_Segmentation/
dataset/underwater/test/images

#test: ../../dataset/yolo_data/test # optional.

# number of classes
nc: 8

# class names
names: ["background", "human divers", "plants", "ruins", "robots", "reefs &
invertebrates", "fish & vertebrates", "sea floor & rocks"]

299
Training / Inference:

from ultralytics import YOLO


import cv2
import numpy as np
import torch

# YOLO performs instance segmentation. For pretrained YOLO segmentation model weights,
see link: https://docs.ultralytics.com/models/yolov8/#supported-tasks

# Before training on underwater dataset, (pretrained only) model detects only person,
not robot or reef.
# After training/Fine-tuning, model segments all classes with good accuracy.

# train_model() is used to train a YOLOv8 model with raw or pretrained


weights(transfer learning), on Bus-Truck dataset.
def train_model(epochs=10):
model = YOLO("yolov8n-seg.pt") # load pretrained weights. will download the
first time.
print("\nTraining started\n")
model.train(data="data.yaml", epochs=epochs)
#success = model.export(format="onnx") # supports other formats too(TF,
torchscript, etc).
print("\nTraining complete.\n")

# resume_training() is used to further train an already trained model, whose weights


file path is mentioned in "model_path".
def resume_training(model_path, epochs=10):
model = YOLO(model_path) # load pretrained weights. will download the first
time.
print("\nTraining resumed\n")
model.train(data="data.yaml", epochs=epochs)
print("\nTraining complete.\n")

300
# perform validation.
def validate_model():
# link: https://docs.ultralytics.com/modes/val/
model = YOLO("runs/segment/train/weights/best.pt") # no arguments needed,
dataset and settings remembered. It seems all the necessary data is stored in
ultralytics database somewhere.
metrics = model.val(data="data.yaml") #model.val() #results =
model.val(data="data.yaml")
#metrics.box.map # map50-95
#metrics.box.map50 # map50
#metrics.box.map75 # map75
print(metrics.box.maps) # a list contains map50-95 of each category.
print("\nValidation complete.\n")

# predict() is used to perform inference using an already trained model.


def predict(img_name):
#model = YOLO("yolov8n-seg.pt") # predict using pretrained model.
model = YOLO("runs/segment/train/weights/best.pt") # predict using own fine-tuned
model.
# pick up from validation folder data.
img_path = "../dataset/underwater/test/images/"
# "stream=True" returns generator(memory efficient). Input to predict() can be
image path, url, opencv|PIL image, tensor, directory, video, etc. link:
https://docs.ultralytics.com/modes/predict/#image-formats for more info on input &
results structure.
results = model.predict(img_path + img_name)
res_plotted = results[0].plot()

# show plotted(annotated) image in window.


#cv2.imshow("window1", res_plotted)
#cv2.waitKey(0)
#print(f"results length: {len(results)}")
# get individual outputs from result. result.orig_img = Original image loaded in
memory, result.orig_shape = original image shape. results can have multiple objects,
if inputs are multiple i.e. say video or directory.

for result in results:

301
all_boxes = result.boxes.boxes # this is for obtaining the class
indices(at array index 5 i.e. 6th element - all_boxes shape is (N,6), where N is the
number of detections).
predicted_class_indices = all_boxes[:,5].to(torch.uint8)
masks = result.masks # masks.segments(bounding coordinates of masks),
masks.data(raw masks tensor). "masks" contains all the masks(classes) in the input
test image.
img = cv2.imread(img_path + img_name) # read original image.
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # no need to do this, image is
displayed properly by opencv automatically.
cv2.imshow("original image", result.orig_img)
cv2.moveWindow("original image", 10, 10)
if not masks or masks.shape[0] == 0:
print("\nNo Detections")
cv2.waitKey(0)
else:
superimpose_segments(img, masks, predicted_class_indices, result.names)
print("\nPrediction completed.\n")

def superimpose_segments(img, masks, predicted_class_indices, class_names):


for index, mask in enumerate(masks):
# superimpose semi-transparent mask with random color on original image.
color = np.array((np.random.randint(0,255), np.random.randint(0,255),
np.random.randint(0,255)), dtype=img.dtype) # generate a random color for mask.
# resize mask to img dimensions. Note that mask's aspect ratio is in accordance
with original image's aspect ratio.
mask = mask.data.numpy().astype(img.dtype) # get actual data from mask
structure.
mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2RGB)
resized_mask = cv2.resize(mask, (img.shape[1], img.shape[0]))
resized_mask *= color # add color value to mask image.
#cv2.imshow("mask", resized_mask)
overlayed_img = cv2.addWeighted(img, 1, resized_mask, 0.5, 0) # OVERLAY mask
image on original image with alpha.
predicted_class_name = class_names[predicted_class_indices[index].item()] #
use predicted class name as window name.
cv2.imshow(predicted_class_name, overlayed_img)
cv2.moveWindow(predicted_class_name, 600, 10)
cv2.waitKey(0)

302
#train_model()

#resume_training("runs/segment/train/weights/best.pt", 10)

#validate_model()

predict("d_r_47_.jpg")
#predict("d_r_58_.jpg")
#predict("d_r_84_.jpg")
#predict("d_r_122_.jpg")
#predict("d_r_129_.jpg")

(D) OBJECT TRACKING:

Object Tracking (& motion estimation) is the process of locating & tracking a
moving object (or multiple objects) over time (video) using a camera. It has a
variety of uses, some of which are: human-computer interaction, security and
surveillance, video communication and compression, augmented reality, traffic
control, medical imaging and video editing.
The objective of video tracking is to associate target objects in
consecutive frames.
Object tracking is an application where the program takes an initial set of
object detections and develops a unique identification for each of the initial
detections and then tracks the detected objects (maintaining their identification)
as they move around frames in a video. In other words, object tracking is the task
of automatically identifying objects in a video and interpreting them as a set of
trajectories with high accuracy (including when objects are occluded, or
disappear for a few frames & come back again).

Some popular techniques for Object tracking are:


- SORT (Simple Online and Realtime Tracking) & DeepSORT (uses
appearance matching of bounding boxes too).
- Multiple Object Tracking (MOT) & FairMOT.

303
- ByteTrack (handles occlusion & hence unwanted ID switching too - can be
used in conjunction with other trackers).
- Kalman Filter (predicting next position), Hungarian/Kuhn–Munkres
algorithm (optimization algo for data association i.e. matching objects in previous
& current frames).
- Optical Flow
(https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html).
- Track Any Point (TAP) techniques:
(a) TAPIR (from Deepmind - Tracking Any Point with per-frame Initialization and
temporal Refinement) - an robust technique for tracking any given point/pixel in a
video.
(b) CoTracker (from Meta AI - Tracking multiple points (say belonging to the
same object) together, so that tracking accuracy is better).

304

You might also like