DL 1 - ComputerVision With PyTorch Notes
DL 1 - ComputerVision With PyTorch Notes
INTRODUCTION:
The core PyTorch modules for building neural networks are located in torch.nn,
which provides common neural network layers and other architectural
components. Fully connected layers, convolutional layers, activation functions,
and loss functions can all be found here.
INSTALLATION:
PYTORCH on CPU:
(1) Install anaconda. Create a new environment in it (if not using conda - as it
itself takes some space - create a simple python virtual environment as “python3
-m venv <env name>”, the activate it as “source venv/bin/activate”(use
“deactivate” to close virtual env). Might need to install pip too: “python3 -m pip
1
install --user --upgrade pip”. VSCode can then be started from this terminal as
“code .” - provided in VSCode “Shell add code to Path…” is clicked (one time)).
(2) Open terminal in conda by clicking on the “Play” button next to the
environment name. Install Pytorch using command:
conda install pytorch torchvision -c pytorch
(link: https://pytorch.org/get-started/locally/)
(3) Install/Launch Jupiter Notebook in Anaconda, then type(Press Shift+Enter to
execute command):
import torch
torch.__version__ # check torch version
torch.cuda.is_available() # check if GPU processing is possible.
- Nvidia commands:
2
(1) nvidia-smi: This utility/command (in command prompt) allows
administrators to query GPU device state and with the appropriate privileges,
permits administrators to modify GPU device state. Also gives info about GPU,
hardware, it’s drivers, CUDA version, etc. (can try -h/--help flag for more info)
Use nvidia-smi -l to get/track (almost) realtime update on memory usage.
(2) nvtop: For GPU process monitoring, a “htop” like task monitor for
AMD, Intel and NVIDIA GPUs. It can handle multiple GPUs and print information
about them in a htop-familiar way.
TENSORBOARD:
3
# We will be creating instances of “SummaryWriter” and then add our model’s evaluation
features like loss, the number of correct predictions, accuracy, etc. to it. One of the novel
features of TensorBoard is that we simply have to feed our output tensors to it and it displays
the plot of all those metrics, in this way TensorBoard can take care of all the plotting for us.
tb = SummaryWriter()
# tb.add_graph(model, images) # displays model architecture graph.
total_loss = 0
total_correct = 0
optimizer.zero_grad()
loss.backward()
optimizer.step() # get LR from optim: optimizer.param_groups[0]['lr'].
4
{
"accuracy": total_correct/ len(train_set),
"loss": total_loss,
},
)
tb.close()
Note that every “tb” takes three arguments, one for the string which will be
the heading of the line chart/histogram, then the tensors containing the values to
be plotted, and finally a global step. Since we are doing an epoch wise analysis,
we have set it to epoch.
After running the code a (by default) “runs” folder will be created in the
project directory (can be changed while creating SummaryWriter()). All runs
going ahead will be sorted in the folder by date. This way you have an efficient
log of all runs which can be viewed and compared in TensorBoard.
(3) Now use the command line(or Anaconda Prompt) to redirect into your
project directory where the “runs” folder is present and run the following
command:
tensorboard --logdir runs
It will then serve TensorBoard on the localhost, the link for which will
be displayed in the terminal.
As seen below, running the command mentioned earlier to run TensorBoard will
display the line graph for the loss, num_correct_predictions, and accuracy.
5
Hyperparameter tuning Visualization:
tb.add_hparams allows us to add hyperparameters inside as
arguments to keep track of the training progress. It takes two dictionaries as
inputs, one for the hyperparameters and another for the evaluation metrics to be
analyzed.
This graph has the combined logs of all the runs so that you can use the
highest accuracy and lowest loss value and trace it back to the corresponding
batch size, learning rate and shuffle configurations.
From the Hyperparameter Graph it is very clear that setting shuffle to
False(0) tends to yield very poor results. Hence setting shuffle to always True(1)
is ideal for training as it adds randomization.
TERMINOLOGIES:
6
- Batch Size, Iterations & Epochs:
A single iteration (or step) refers to a single update of the model's weights
based on one batch of training data.
Batch size determines the size of the batch during a single weight update.
An Epoch is a complete pass through the entire training dataset.
Ex:
for epoch in range(<number of epochs>):
# iterate over multiple batches/iterations (say I), till the entire dataset is covered .
for (inputs, labels) in <dataloader>:
# perform training with model on inputs (of size = batch size)
# loss calculation.
# optimizer step i.e. weights update. # single iteration.
print(epoch) # one epoch ended.
tqdm: tqdm is a handy tool for showing progress bar during training, that
can indicate how much progress has been made. It works on all iterables.
Use {pip install tqdm} or {conda install -c conda-forge tqdm} (on conda) to install
tqdm.
Example code:
from tqdm import tqdm
Note using print() in the section inside “tqdm” does not work well. Use either one
of print() or tqdm, to display information. If not printing info over epochs via
print(), can also use tqdm (in the outer loop) as:
for epoch in tqdm(range(<number of epochs>)):
7
- Some common Deep NN Architectures:
=> AlexNet (8 layers deep, 1000 categories, input 227x227x3)
Alexnet architecture.
=> EfficientNet (A CNN architecture and scaling method that uniformly scales all
dimensions of depth/width/resolution using a compound coefficient. Unlike
conventional practice that arbitrarily scales these factors, the EfficientNet scaling
method uniformly scales network width(channels), depth(layers), and
resolution(input image size) with a set of fixed scaling coefficients.
The compound scaling method is justified by the intuition that if the input
image is bigger, then the network needs more layers to increase the receptive
field and more channels to capture more fine-grained patterns on the bigger
image.
8
EfficientNets also transfer well.
9
N))
10
underlying distribution of data. These biases can influence the model's ability to
learn from a given dataset and can affect the performance of the model on new,
unseen data.
A model with too strong of an inductive bias may fail to capture the
complexity of the underlying data, while a model with too weak of an inductive
bias may overfit the training data.
There are several ways to describe the inductive bias of a model,
including:
- The choice of model architecture.
- The selection of features.
- The type of regularization applied to the model.
For example, a linear regression model has an inductive bias
towards linear relationships between variables, while a decision tree has an
inductive bias towards creating simple, hierarchical partitions of the data.
The inductive bias of a model is a trade-off between its ability to fit the
training data and its ability to generalize to new examples.
11
Computation problems are classified into different complexity classes based on
the minimum time complexity required to solve the problem:
12
4) NP-Complete: Problems that are both NP-hard and in NP. Problems
for which the correctness of each solution can be verified quickly, and a brute-
force search algorithm can actually find a solution by trying all possible solutions.
Variance “𝝈2” of a variable with observations {x1, x2,..., xn} with mean “μ” is
calculated as:
V (𝝈2) = [(x1 - μ)2 + (x2 - μ)2 + .. + (xn - μ)2] / N.
i.e. V (𝝈2) = ΣNi=1 (xi - μ)2 / N.
13
Standard deviation is more interpretable as it is in the same units as the
data. It tells you the average "distance" of data points from the mean. It is less
sensitive to outliers.
Variance is often used in statistical calculations but might not provide as
intuitive a sense of the data's spread.
14
This is because using n instead of n−1 would lead to a biased estimate of the
population variance. Dividing by n−1 corrects this bias, accounting for the fact that
you are using the sample mean (x̄ ) to estimate the population mean (μ).
For population covariance, divide by ‘n’, not (n-1).
Code:
Types of correlation:
1) Positive Correlation: When the value of one variable increases, the
other variable also increases.
2) Negative Correlation: When the value of one variable increases, the
other variable decreases.
3) No Correlation: When there is no linear relationship between two
variables.
15
The measure of correlation is known as the correlation coefficient. The range of
the correlation coefficient is -1 to +1. A scatterplot is used to visualize the
correlation between two numerical variables.
Code:
To normalize this and to get rid of units, we use the correlation coefficient.
Correlation Coefficient = Cov(x,y) / (std(x) * std(y))
16
The Correlation Coefficient is calculated by dividing the Covariance of x,y by the
Standard deviation of x and y.
Units of Cov(x,y) = (unit of x)*(unit of y)
Units of the standard deviation of x = unit of x.
Units of the standard deviation of y = unit of y.
Intuition:
If the distance (std) from the mean for one variable tends to be broadly
consistent with distance (std) from the mean for the other variable (e.g. people
who are far from the mean for height in either direction tend also to be far from
the mean in the same direction for weight), then we would expect a strong
positive correlation.
If distance from the mean for one variable tends to correspond to a similar
distance from the mean for the second variable in the other direction (e.g. people
who are far above the mean in terms of exercise tend to be far below the mean
in terms of weight), then we would expect a strong negative correlation.
If two variables do not tend to deviate from the mean in any meaningful
pattern (e.g., patterns of shoe size and exercise) then we would expect little or
no correlation.
NOTE:
17
can recommend films that like-minded customers have rated highly but that I
have not yet seen.
When the standard deviation for the population is calculated from a smaller
sample, the formula is tweaked slightly: SE = std / √(n-1).
18
Normal Distribution (Bell curve) & std percentages for the same (68% - 95% - 99.7% rule) .
19
In pandas, covariance & correlation of data in a dataframe df can be computed
using df.cov() & df.corr() respectively.
20
components are often computed by eigendecomposition of the data covariance
matrix or singular value decomposition (SVD) of the data matrix.
Principal components are constructed in such a manner that the first principal
component accounts for the largest possible variance in the data set.
The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first principal
component and that it accounts for the next highest variance.
This continues until a total of p principal components have been
calculated, equal to the original number of variables.
Visualizing three vectors through a horizontal scaling. The vectors at 0 & 90 degrees (vertical &
horizontal) are eigenvectors, whereas the one at 45 degrees is not.
Eigenvectors are used in dimensionality reduction.
Ex: Given a set of variables that contain information from a dataset, can we use the
information stored in these variables and extract a smaller set of variables (features) to train a
model and do the prediction while ensuring that most of the information contained in the original
variables is retained/maintained. This will result in simpler and computationally efficient models.
This is where eigenvalues and eigenvectors come into the picture.
21
Uses of PCA:
1) It is used to find interrelations between variables in the data i.e. identifying
patterns.
2) It is used to interpret and visualize data in a simpler way using
“Dimensionality Reduction” i.e. the number of variables is decreasing which
makes further analysis simpler.
3) Noise reduction: PCA can be used to reduce the noise in a dataset by
identifying and removing the principal components that correspond to the noisy
parts of the data.
Computing PCA:
22
Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables. These combinations are done in
such a way that the new variables (i.e., principal components) are uncorrelated
and most of the information within the initial variables is squeezed or
compressed into the first components - An important thing to realize here is that
the principal components are less interpretable and don’t have any real meaning
since they are constructed as linear combinations of the initial variables.
Geometrically speaking, principal components represent the directions of
the data that explain a maximal amount of variance, that is to say, the lines that
capture most information of the data. The relationship between variance and
information here, is that, the larger the variance carried by a line, the larger the
dispersion of the data points along it, and the larger the dispersion along a line,
the more information it has.
They always come in pairs, so that every eigenvector has an eigenvalue. And
their number is equal to the number of dimensions of the data. For example, for a
3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors
with 3 corresponding eigenvalues.
The eigenvectors of the Covariance matrix are actually the directions of
the axes where there is the most variance (most information) and that we call
Principal Components. Eigenvalues are simply the coefficients attached to
eigenvectors, which give the amount (magnitude) of variance carried in each
Principal Component. By ranking your eigenvectors in order of their eigenvalues,
highest to lowest, you get the principal components in order of significance.
Computation:
Solve the characteristic equation to find the eigenvalues of the covariance
matrix C. The characteristic equation is given by:
det(C − λI) = 0
where λ is an eigenvalue and I is the identity matrix (diagonals=1, rest=0).
Solve for λ to get the eigenvalues (det() is determinant of matrix - single value).
23
Determinant of a matrix.
After obtaining the eigenvalues, you can find the corresponding eigenvectors by
solving the system of linear equations:
(C − λiI)vi = 0
Here, vi is the eigenvector corresponding to the ith eigenvalue.
Illustration:
Let's consider a 3x3 covariance matrix C for a dataset with three
variables X1, X2, X3:
(i) Eigenvalues: The characteristic equation is det(C − λI) = 0. So, solve for λ:
24
(ii) Eigenvectors: For each eigenvalue λi, solve the system of equations (C −
λiI)vi = 0 to find the corresponding eigenvector vi.
[C3x3] [v3x1] = [0]3x1. [v3x1] is the eigenvector - note that 3 is the number of variables in the dataset .
- Linear Interpolation:
1D points:
Value = Value at Lower Bound + (Fractional Distance * Difference in Values)
Value at Lower Bound: The known value at the lower bound (point with a lower coordinate
in 1D).
25
Fractional Distance: The distance between the target point (at Value) and the lower
bound point, divided by the total distance between the two lower(start) & upper(end) points.
Difference in Values: The difference between the values at the upper and lower bounds.
In above example:
More weight is given to the nearest value(See 1/3 and 2/3 in the above figure).
For 2D (e.g. images), we have to perform this operation twice, once along rows
and then along columns that's why it is known as Bi-Linear interpolation.
A geometric visualization of bilinear interpolation. The product of the value at the desired point
(black) and the entire area is equal to the sum of the products of the value at each corner and
the partial area diagonally opposite the corner (corresponding colours).
2D points: If the two known points are given by the coordinates (x0, y0)
and (x1, y1), the linear interpolant is the straight line between these points. For a
value x in the interval (x0 , y1), the value y along the straight line is given from
the equation of slopes:
26
- Linear Regression vs Logistic Regression:
- Activation Functions:
27
Sigmoid: When (multiple) output classes are not mutually exclusive
(particular input data can contain all, some or none of the output classes i.e.
output probabilities are independent of each other), then use a sigmoid. The
sigmoid will allow you to have a high probability for all of your classes, some of
them, or none of them. Sigmoid activation is for one per output.
In short: If your model’s output classes are NOT mutually exclusive and you
can choose many of them at the same time, use a sigmoid function on the
network’s raw outputs.
The main reason why we use sigmoid function is because it exists between 0 to
1. Therefore, it is especially used for models where we have to predict the
probability as an output. Used in situations, where the sum of probabilities of
classes/labels for an input data need not sum to 1. Also used for Binary
classification (if {0-1}value is greater than a threshold, then class1; else class2).
Sigmoid can also be used where outputs are continuous, whereas Softmax is
used where outputs are categorical.
If we want to have a classifier to solve a problem with more than one right
answer (i.e. outputs are NOT mutually exclusive), the Sigmoid Function is the
right choice, applied to each raw output independently.
Ex: input image might contain dog, cat, horse, or none of them.
Mathematical formula:
s(x) = 1 / (1 + e(-x))
The function maps any input value to a value between 0 and 1. The
range of the function is (0,1), and the domain is (-infinity,+infinity).
28
probability distribution proportional to the exponentials of the input numbers. After
applying Softmax, each element will be in the range of 0 to 1, and the elements
will add up to 1 (outputs ARE mutually exclusive).
Ex: input image of a single digit can only be one of {0 to 9}.
Hence, Unlike other activations, softmax is performed on top (i.e. taking
into consideration all the output values) of an array of values.
Equation:
Code:
def softmax(x):
return np.exp(x) / np.sum(np.exp(x))
# “np.sum(np.exp(x))” is taking all the raw output values,
where x is the vector of outputs from NN.
It returns a vector, same size as “x”, that contains probabilities for each
element in x.
Here, the Z represents the values from the neurons of the output layer. The exponential acts as
the nonlinear function. Later these values are divided by the sum of exponential values in order
to normalize and then convert them into probabilities.
“j” represents the number of neurons in the output layer (over the entire length of the “z” input
vector).
Ex:
29
Tanh: tanh is also like logistic sigmoid but better. The range of the tanh
function is from (-1 to 1). tanh is also sigmoidal (s - shaped). The advantage is
that the negative inputs will be mapped strongly to negative and the zero inputs
will be mapped near zero in the tanh graph (Ex: can be used in prediction of
bounding boxes (say, relative to object center - start point will be negative value)
in object detection; where predicted values can be negative too).
Equation:
tanh(x) = sinh(x) / cosh(x) = (ex − e−x) / (ex + e−x)
Properties of tanh:
(1) Negative value of the angle gives negative value of the tan function, tan − x = −
tan x.
30
(2) It is a periodic function and its period is π.
(3) It is symmetric about the origin.
(4) Its domain is a set of all real values except x = π/2 + n*π , where n is an
integer.
(5) Its range is (−∞,∞) .
ReLU (Rectified Linear Unit): the ReLU is half rectified (from bottom). f(z)
is zero when z is less than zero and f(z) is equal to z when z is above or equal to
zero.
Equation: f(x) = max(0, x) returns the larger of (0, x).
Leaky ReLU: With a Leaky ReLU (LReLU), you won’t face the “dead ReLU”
(or “dying ReLU”) problem which happens when your ReLU always have values
under 0 - this completely blocks learning in the ReLU because of gradients of 0 in
the negative part. So:
ReLU: The derivative of the ReLU is 1 in the positive part, and 0 in the
negative part.
LReLU: The derivative of the LReLU is 1 in the positive part, and is a small
fraction in the negative part.
Now, think about the chain rule in the backward pass. If the derivative of
the slope of the ReLU is of 0, absolutely no learning is performed on the layers
below the dead ReLU, because 0 will be multiplied to the accumulated gradient
for the weight update. Thus, you can have dead neurons. This problem doesn’t
happen with LReLU or ELU for example, they will always have a little slope to
allow the gradients to flow on.
Equation:
f(x) = max(0.01*x , x)
This function returns x if it receives any positive input, but for any negative value
of x, it returns a really small value which is 0.01 times x. Thus it gives an output
for negative values as well.
- GELU: The GELU (Gaussian Error Linear Unit) activation function is a non-
linear activation function that weights the input by its probability under a
Gaussian/Normal distribution.
Mathematical Expression:
31
GELU(x) = (x/2) * (1 + erf(x/√2)) # where erf denotes the error function.
= 0.5x * (1 + tanh(sqrt(2/pi) * (x + 0.044715x^3)))
The GELU function is based on the cumulative distribution function of a
Gaussian (normal) distribution. It smoothly approximates the ReLU function while
being differentiable everywhere.
GELUs are used in GPT-3, BERT, and most other Transformers. If you
combine the effect of ReLU, zone out (maintain previous value - a method for
regularizing RNNs), and dropout, you get GELU.
Activation functions like ReLU, ELU and PReLU have enabled faster and better
convergence of Neural Networks than sigmoids.
Also, Dropout regularizes the model by randomly multiplying a few activations by
0.
Both of the above methods together decide a neuron’s output. Yet, the two work
independently from each other. GELU aims to combine them.
NOTE: Zero centered activation functions ensure that the mean
activation value is around zero, hence can easily map the output values as
strongly negative, neutral, or strongly positive. Ex: Tanh.
Sigmoid, ReLU, GELU are not zero centered functions.
Properties of GELU:
Smoothness: GELU is smooth and continuous, which makes it suitable for
gradient-based optimization algorithms like backpropagation.
Range: GELU outputs values in the range [0, 1], similar to sigmoid,
but with a wider range of non-linearity around 0.
Saturation: GELU does not suffer from the vanishing gradient problem as
much as sigmoid, especially in deep networks.
32
Compared to ReLU: GELU is smooth and approximately zero-
centered (i.e. its expected value is close to zero when applied to a large number
of inputs), addressing the "dying ReLU" (output is 0 when input <=0) problem
where neurons can become inactive during training, however GELU has a non-
zero (i.e. differentiable) gradient at input = 0, which allows the network to learn in
this region. It is often preferred for deeper networks.
Compared to Sigmoid: GELU provides a wider range of non-linearity and
typically converges faster during training.
Compared to Tanh: GELU is similar to tanh in terms of smoothness
and saturation characteristics but has a different shape and can outperform tanh
in certain scenarios.
33
NOTE: Logits in DL can also be interpreted as the raw NN outputs; that are
unnormalized, before being fed to an activation function such as sigmoid or
softmax.
NOTE:
Probability vs Odds of an outcome: The probability that an event will
occur is the fraction of times you expect to see that event in many trials.
Probabilities always range between 0 and 1.
The odds of an outcome are the ratio of the probability that the
outcome occurs to the probability that the outcome does not occur i.e. { p /
(1 - p) }.
Probabilities between 0 and 0.5 equal odds less than 1.0. A probability of
0.5 is the same as odds of 1.0. Think of it this way: The probability of flipping a
coin to heads is 50%. The odds are “fifty: fifty,” which equals 1.0. As the
probability goes up from 0.5 to 1.0, the odds increase from 1.0 to approach
infinity. For example, if the probability is 0.75, then the odds are 75:25, three to
one, or 3.0.
If the odds are high (million to one), the probability is almost 1.0. If the
odds are tiny (one to a million), the probability is tiny, almost zero.
34
3) Calculating probability considers all potential outcomes of an event, while
calculating odds involves comparing the number of desired outcomes against the
number of possible unwanted outcomes.
35
likely overfitting will be, since the model can use a greater number of parameters
to memorize unessential aspects of the input.
Depth (number of layers) allows a model to deal with hierarchical information
(increased complexity) when we need to understand the context in order to say
something about some input.
For ex: In regard to computer vision, a shallower network could identify a
person’s shape in a photo, whereas a deeper network could identify the person,
the face on their top half, and the mouth within the face.
Adding depth to a model generally makes training harder to converge (due
to vanishing / exploding gradients problem, later resolved by Resnet
architectures via the ReLU activation units).
36
outputs for the highly weighted samples instead of changes to some other
sample’s output that had a smaller loss.
For example: The square difference ((trueValue - predictedValue)^2)
penalizes wildly wrong results more than the absolute difference((trueValue -
predictedValue)) does. Often, having more slightly wrong results is better than
having a few wildly wrong ones, and the squared difference helps prioritize those
as desired.
In pytorch, we can get the scalar loss value of a loss object for
plotting in a graph.
Ex:
lossFn = nn.MSELoss()
loss = lossFn(predictions, labels)
loss.backward()
lossValue = loss.item() # get the scalar loss value.
lossValuesList.append(lossValue) # plot lossValuesList in graph later.
37
Corrected probability is the probability that a particular observation belongs to its original class.
As shown in the above image, ID6 originally belongs to class 1 hence its predicted probability
and corrected probability is the same i.e 0.94. On the other hand, the observation ID8 is from
class 0. In this case, the predicted probability i.e the chances that ID8 belongs to class 1 is 0.56
whereas, the corrected probability means the chances that ID8 belongs to class 0 is ( 1-
predicted_probability) is 0.44.
The reason behind using the log value is, the log value offers less penalty for
small differences between predicted probability and corrected probability. When
the difference is large the penalty will be higher.
Since all the corrected probabilities lie between 0 and 1, all the log values are
negative. In order to compensate for this negative value, we will use a negative
average of the values.
Here, pi is the probability of class 1, and (1-pi) is the probability of class 0. When the observation
belongs to class 1 the first part of the formula becomes active and the second part vanishes
and vice versa in the case observation’s actual class is 0. This is how we calculate the Binary
cross-entropy.
38
(b) Cross Entropy Loss (also called logarithmic loss, log loss or logistic loss):
The purpose of the Cross-Entropy is to take the output
probabilities (P) and measure the distance from the truth values.
For the example above, the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] . The objective is to make the model
output be as close as possible to the desired output (truth values).
ti represents the true class label for class ‘i’ (1 if the true class is ‘i’, 0 otherwise).
39
Computing Cross Entropy Loss for above example.
Ex: if the GT is 1, & predicted probability is 0.8, then CE_Los = -(1 *
ln(0.8)) = -(1 * -0.223) = 0.223. (since 0.8 is closer to GT 1, loss is less(0.223)).
if the GT is 1, & predicted probability is 0.2, then CE_Los = -(1 * ln(0.2)) = -(1 * -
1.609) = 1.609. (since 0.2 is farther from GT 1, loss is more(1.609)).
Mean squared error (MSE) loss is calculated by averaging over squaring the
difference between true value y (GT) and the predicted value 𝑦̂.
40
Similar to MSE, but using the absolute difference value between GT & predicted
value:
This treats all differences in error of data points as equal, hence more robust to
outliers (some data points that have huge errors).
- Metrics:
Loss functions are used during the training phase, whereas Metrics is used
during validation & testing phase. It's important to choose appropriate loss
functions and evaluation metrics that align with the goals of your machine
learning task.
Commonly used metrics: Accuracy, confusion matrix, log-loss (CrossEntropy
loss), and AUC-ROC (Area Under Curve - Receiver Operator Characteristics),
etc.
41
Example Code:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
class_names = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
cm = confusion_matrix(labels, predictions, labels=class_names)
ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=class_names).plot() # Confusion Matrix visualization.
- Backpropagation:
This is a reverse step of forward pass: We start with the
loss value obtained in feedforward propagation and update the weights “w” (in y
42
= w*x + b, where y = output, x = input, b = bias) of the network in such a way that
the loss value is minimized as much as possible, in following way:
(1) Compute the (original) loss value obtained in feedforward propagation
(with original weights w0).
(2) Change each weight within the neural network by a small amount – one at
a time & compute the new loss.
(3) Measure the change in loss (δL i.e. difference in loss before & after weight
change) when the weight value is changed (δW) i.e. gradient (δL / δW)
(gradients can be positive or negative, thus will result in increase or decrease of
weights during updation).
(4) Update the (original) weight (& bias) (Gradient Descent) by ( -⍺ * (δL / δW)
) (where ⍺ (alpha) is a positive value and is a hyperparameter known as the learning
rate).
Formula: w0 = w0 - ( ⍺ * (δL / δW) )
“w” is the parameter, e.g., the weight in a neural network, and is the
objective, e.g., the loss function. What it does is to move to the direction that you
can minimize the loss. The direction is provided by the differentiation (δL / δW),
but how much you should move is controlled by the learning rate ⍺.
Note that the amount of update made to a particular weight is proportional to the
amount of loss that is reduced by changing it by a small amount.
Intuitively, if changing a weight reduces the loss by a large value, then we
can update the weight by a large amount. However, if the loss reduction is
small by changing the weight, then we update it only by a small amount.
i.e.
(1) compute loss (L_1) with current weights
(2) modify weights (add by a constant C) & recompute new loss (L_2)
(3) compute gradient = (L_2 - L_1) / C.
(4) update weights according to gradient:
updated_w -= gradient * learning_rate.
Chain rule: Computing grads for each weight separately for a huge
network can be computationally expensive. Hence we can use chain rule in
backpropagation:
Performing a chain of differentiations to fetch the differentiation of our
interest i.e. computing gradient of weights at a layer, & then using that gradient to
43
compute gradient of weight in the previous layer, thus avoiding recomputations of
gradients.
44
Bootstrapping for Reinforcement Learning: In reinforcement learning,
bootstrapping techniques involve estimating future rewards or values by
iteratively updating and refining these estimates based on current observations
and actions.
- Ensemble Learning:
Ensemble learning involves combining multiple individual models to
produce a stronger, more robust model that typically performs better than any
single constituent model.
Bucket of models: A "bucket of models" is an ensemble technique in
which a model selection algorithm is used to choose the best model for each
problem. When tested with only one problem, a bucket of models can produce no
better results than the best model in the set, but when evaluated across many
problems, it will typically produce much better results, on average, than any
model in the set.
Gating: It involves training another model to decide which of the
models in the bucket (i.e. list of models) is best suited to solve the problem.
Often, a perceptron is used for the gating model. It can be used to pick the "best"
model, or it can be used to give a linear weight to the predictions from each
model in the bucket.
45
order to determine how the learning rate should update. Regularly, a model is
evaluated with a validation dataset once per epoch.
There are multiple ways of making learning rate adaptive. At the beginning of
training, you may prefer a larger learning rate so you improve the network
coarsely to speed up the progress.
Different Learning Rate Schedulers. Note that schedulers like Cyclic & Cosine
have the ability to again increase the LR, after it has been decreased.
In a very complex neural network model, you may also prefer to gradually
increase the learning rate at the beginning because you need the network to
explore the different dimensions of prediction. At the end of training, however,
you always want to have the learning rate smaller. Since at that time, you are
about to get the best performance from the model and it is easy to overshoot if
the learning rate is large.
Therefore, the simplest and perhaps most used adaptation of the learning
rate during training are techniques that reduce the learning rate over time.
There are many learning rate schedulers provided by PyTorch in the
torch.optim.lr_scheduler submodule. All the schedulers need the optimizer to
update as the first argument. Depending on the scheduler, you may need to
provide more arguments to set up one.
Note that even though you initially set a learning rate in the optimizer, the
scheduler ultimately determines the learning rate used during optimization. This
46
allows you to implement various learning rate decay schedules and fine-tune the
training process.
Different LRS:
In a one-cycle learning rate policy, the learning rate is adjusted over the course
of training in a cyclical pattern. The policy typically consists of two phases:
Annealing Phase (Decreasing Learning Rate): After the warm-up phase, the
learning rate is gradually decreased. This phase may also be implemented in a
linear or geometric manner. During this phase, the learning rate becomes smaller
and allows the model to fine-tune and generalize better.
OneCycleLR() anneals the learning rate from an initial learning rate to some
maximum learning rate (max_lr) and then from that maximum learning rate to
some minimum learning rate much lower than the initial learning rate.
In PyTorch, you can implement a one-cycle learning rate policy using a learning
rate scheduler such as torch.optim.lr_scheduler.OneCycleLR(<optimizer>,
47
max_lr=, total_steps=None, epochs=None, steps_per_epoch=None,
pct_start=0.3, anneal_strategy='cos', cycle_momentum=True,
base_momentum=0.85, max_momentum=0.95, div_factor=25.0,
final_div_factor=10000.0, …).
Note also that the total number of steps (iterations) in the cycle can be
determined in one of two ways (listed in order of precedence):
A value for total_steps is explicitly provided.
A number of epochs (epochs) and a number of steps per epoch
(steps_per_epoch) are provided. In this case, the number of total steps
(iterations) is inferred by total_steps = epochs * steps_per_epoch.
You must either provide a value for total_steps or provide a value for both
epochs and steps_per_epoch.
(steps_per_epoch = len(<train-set dataloader>))
The One cycle learning rate policy changes the learning rate after every batch.
step should be called after a batch has been used for training (i.e. after each
iteration).
“pct_start” is the percentage of the cycle (in number of steps) spent increasing
the learning rate (default = 0.3).
“div_factor” determines the initial learning rate via initial_lr = max_lr/div_factor
(default = 25).
“final_div_factor” determines the minimum learning rate via min_lr =
initial_lr/final_div_factor (default = 1e4).
To be exact, the learning rate will increase from an initial lr value = (max_lr /
div_factor) to max_lr in the first (pct_start * total_steps) steps (iterations); and
decrease smoothly to a (final) minimum lr value (initial_lr / final_div_factor)
then.
Ex Code:
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.01,
steps_per_epoch=len(data_loader), epochs=10)
for epoch in range(10):
for batch in data_loader:
train_batch(...)
optimizer.step()
scheduler.step() # called every step (iteration).
48
- Lambda LR:
Sets the learning rate of each parameter group (of optimizer) to the initial lr times
a given function (lr_lambda). When last_epoch=-1, set initial lr as lr.
Code:
plt.plot(range(10),lrs)
49
Output:
15
- Linear LR:
import torch
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
50
- Exponential LR:
- Cyclic LR (triangular2):
Sets the learning rate of each parameter group according to cyclical learning rate
policy (CLR). The policy cycles the learning rate between two boundaries with a
constant frequency, as detailed in the paper Cyclical Learning Rates for Training
Neural Networks. The distance between the two boundaries can be scaled on a
per-iteration or per-cycle basis.
The learning rate values change in a cycle from more minor to higher and
vice versa. This method helps the model get out of the local minimum or a saddle
point while not skipping the global minimum.
Code:
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
51
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001,
max_lr=0.1,step_size_up=5,mode="triangular2")
lrs = []
for i in range(100):
optimizer.step() # optimizer step.
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step() # update learning rate - done per step (or batch).
plt.plot(lrs)
Code:
import torch
model = [Parameter(torch.randn(2, 2, requires_grad=True))]
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate,
weight_decay=0.01, amsgrad=False)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr, max_lr,
step_size_up=2000, step_size_down=None, mode='triangular',
gamma=1.0,cycle_momentum=false)
52
scheduler.step() # update learning rate - done per epoch.
Parameters:
base_lr: The initial learning rate, which is the lower boundary of the
cycle.
mode: There are different techniques in which the learning rate can vary
between the two boundaries:
- ‘triangular’: In this method, we start training at the base learning
rate and then increase it until the maximum learning rate is reached. After
that, we decrease the learning rate back to the base value. Increasing and
decreasing the learning rate from min to max and back take half a cycle
each.
53
- ‘triangular2’: In this method, the maximal learning rate
threshold is cut in half every cycle. Thus, you can avoid getting stuck in the
local minima/saddle points while decreasing the learning rate.
54
gamma: The constant variable in the ‘exp_range’ scaling function - a
multiplicative factor by which the learning rate is decayed. For instance, if the
learning rate is 1000 and gamma is 0.5, the new learning rate will be 1000 x 0.5
= 500.
The gamma value should be less than 1 to reduce the learning rate.
55
- CosineAnnealingWarmRestarts:
Notice the gradual increase in LR, after LR reaches floor (least) value.
Notice the instant ‘reset’ of LR to original value, after LR reaches floor value.
56
The CosineAnnealingWarmRestarts Scheduler requires some extra steps to
function properly.
torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0,
T_mult=1, eta_min=0, last_epoch=-1, verbose=False)
Set the learning rate of each parameter group using a cosine annealing
schedule, where ηmax is set to the initial lr, Tcur is the number of epochs since the
last restart and Ti is the number of epochs between two warm restarts in SGDR:
When Tcur = Ti, set ηt = ηmin. When Tcur= 0 after restart, set ηt = ηmax.
Ex: If T_0 = 3, T_mult = 1 and eta_min = 0.0001 & initial learning rate =
0.001; the scheduler will start with an initial learning rate of 0.001 and
reduce it to 0.0001 in every 3 epochs. Then, it'll start again with a learning rate of
0.001 and decrease it to 0.0001 in 3 epochs.
In order to properly change Learning Rate for longer training, you should
ideally pass the epoch number while invoking the step() function. Like so:
Code:
optimizer = ...
scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, ...)
iters = len(dataloader)
57
NOTE: If passing the epoch number is not done, this usually leads to a
more erratic change of learning rate, and not gradual and smooth as expected.
58
t1[None] is similar to performing t1.unsqueeze(0) (along 0th dimension). #
shape = [1, 3, 100, 100].
t1[:, None] is similar to performing t1.unsqueeze(1) (along 1st dimension). #
shape = [3, 1, 100, 100].
t1[..., None] will insert a dimension at the last index.
t1[..., None, :] on the before the last dimension.
- Optimizers: Optimizers are functions that control how the weights &
learning rate are changed, to decrease the associated error i.e. optimize
accuracy. Torch has an ‘optim’ module that provides a list of optimizers:
import torch.optim as optim
Every optimizer constructor takes a list of (model) parameters (aka
PyTorch tensors, typically with requires_grad set to True) as the first input. All
parameters passed to the optimizer are retained inside the optimizer object so
the optimizer can update their values and access their grad attribute.
Each optimizer exposes two methods: zero_grad() and step().
zero_grad() zeroes the grad attribute of all the parameters passed to the
optimizer upon construction, to clear out the (previous) gradients of all
parameters that the optimizer is tracking, as the gradients are accumulated (This
is useful when we want to accumulate gradients across multiple batches, but it
can lead to incorrect gradient computations when we only want to compute the
gradients for a single batch).
step() updates the value of those parameters according to the optimization
strategy implemented by the specific optimizer.
Ex: Adam, SGD, RMSProp, etc.
Adam:
- Adam (Adaptive Moment Estimation) dynamically adjusts the learning rate
for each parameter during training (SGD used fixed LR). It computes a separate
adaptive learning rate for each parameter, which can be particularly useful when
dealing with sparse or noisy gradients.
- Adam incorporates momentum-like behavior through the first-order
moment (mean) of the gradients. It helps smoothen the optimization process and
escape local minima i.e. convergence.
59
- It includes a bias correction mechanism to counteract the initialization bias
in the first few iterations, especially when the moving averages are small at the
beginning of training (SGD does not have this).
- Adam often converges faster and requires less tuning of hyperparameters
like learning rate, compared to vanilla SGD. However, the choice of optimizer
may depend on the specific problem and dataset.
- Overfitting: When the model performs well on training data (loss is less),
but loss is comparatively much more (than acceptable difference) in validation
data, the model is said to overfit. It means the model is not able to generalize
well.
If both training & validation loss is high, it means the model is not able to learn at
all (underfitting).
To reduce overfitting, there is a tradeoff between having a model that can
generalize well, but not overfit. Overfitting can occur when the model has a much
higher number of parameters/neurons than necessary. Hence, one way of
reducing overfitting is to start training with a model that has fewer number of
neurons, then start increasing the neurons, till the model does not overfit, while
still maintaining generalization i.e. increase the parameters size until it fits, and
then scale it down until it stops overfitting.
Another way to reduce overfitting is adding penalization terms to the loss
function, to make it cheaper for the model to behave more smoothly and change
more slowly.
Training set: A set of examples used for learning, that is to fit the
parameters of the model.
60
Validation set: A set of examples used to tune the hyper-parameters of
a model (manually, or via automated frameworks), for example to choose the
number of hidden units in a neural network.
The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration.
Following points can also be used during training, to analyze & improve
model performance:
61
(2) Batch size & Shuffling: Usually, a smaller batch size means
better model performance, as more weight updates are performed (batch size of
8-16 is considered as good). Shuffling batches means weights can be avoided
from getting stuck in local minima.
The larger the batch size, more is the time taken to converge and more
iterations required to attain a high accuracy.
(3) Scaled inputs/Normalisation: Scaling all inputs/outputs in a
common range of 0 to 1, or -1 to 1, increases the possibility of the model being
able to better fit the input data.
(4) Learning Rate: Usually, a small learning rate results in more
stable learning, as well as the model being able to fit training data. Alternatively,
learning rate can be annealed (LR scheduling) by being large at beginning of
training, & being lowered, when validation losses do not decrease. Can also use
a learning rate scheduler.
(5) Loss Function selection.
(6) Optimizer selection.
Overfitting related:
1) Scenario:
(a) Train a model for E epochs. At the end, loss settles at a small value, say L.
(b) When retrain the model for much larger epochs (say 3*E or 4*E times or
more), loss decreases slowly, and finally settles at value similar to L, as in (a).
Tip:
62
Stuck in local optima probably. Try to use Larger learning rate if there is little
change of loss over several iterations - Use a scheduler that periodically starts
from larger LR and brings it down gradually and then again starts from a bigger
one.
TENSORS:
63
and existing ones permuted to the right order:
# align Img3’s data as per Img1’s names, into Img4.
Img4 = Img3.align_as(Img1)
Img4 = Img3.rename(None) # remove all names (unnamed).
- PyTorch tensors can be converted to NumPy arrays and vice versa very
efficiently using tensor.numpy() & torch.from_numpy().
64
The disadvantage of this approach is that the serialized data is bound to
the specific classes and the exact directory structure used when the model is
saved. The reason for this is because pickle does not save the model class itself.
Rather, it saves a path to the file containing the class, which is used during load
time. Because of this, your code can break in various ways when used in other
projects or after refactors.
Installation:
pip:
pip install -U albumentations
conda:
conda install -c conda-forge imgaug
conda install -c conda-forge albumentations
Code:
# Albumentations uses the most common and popular RGB image format, so use cv2.cvtColor()
to convert BGR to RGB, where applicable.
# Albumentations works with numpy arrays.
65
(2)
(a) Define an augmentation pipeline:
transform = A.Compose([
A.RandomCrop(width=256, height=256),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.2),
])
# Each augmentation will change the input image with the probability set by the parameter p.
Also, many augmentations have parameters that control the magnitude of changes that will be
applied to an image.
(b) Usage:
transformed = transform(image=image)
# transform will return a dictionary with a single key image. Value at that key will contain an
augmented image.
transformed_image = transformed["image"]
OR
transformed_image = transform(image=image)["image"]
66
Keypoints (affine-translation) augmentation (link:
https://albumentations.ai/docs/getting_started/keypoints_augmentation/)
Example:
translate_dict = {"x":(-0.38, 0.3), "y":(-0.13, 0.34)}
aff_transforms = A.Compose([A.Affine(translate_percent=translate_dict,
keep_ratio=True)], keypoint_params=A.KeypointParams(format='xy',
remove_invisible=False))
transformed = aff_transforms(image=img, keypoints=kps) # keypoint (kps) values
should be in image dimensions (i.e. image width, height), not normalized (i.e. 0-1).
transformed_image = transformed['image']
transformed_keypoints = transformed['keypoints']
SCRIPTING:
USEFUL FUNCTIONS:
- torch functions that end with an underscore i.e. “_”, perform operations in-
place i.e. they modify the data of the same provided variable tensor.
67
Note that these in-place functions are methods on the torch.Tensor object, not
attached to the torch module like many other functions (e.g., torch.sin()).
Ex:
68
- tensor.view(*shape): This is similar to numpy.reshape(), & is useful to change
the shape of a tensor, with data being the same.
Calling view() on a tensor returns a new tensor (header) that changes the
number of dimensions and the striding information, without changing the storage.
Ex:
a = torch.range(1, 16) # a contains 16 elements(1 row, 16 cols).
a = a.view(4, 4) # a contains 16 elements (4 rows, 4 cols).
If we are not sure of 1 field (say columns), but know how many of the other
fields (say rows) we want, we can specify the former field as -1, pytorch will
automatically infer the 2nd field. Note that at most only 1 dimension can be -1, in
case the indexes dimensions are more than 2.
Ex:
a = a.view(2,-1) # a contains 16 elements (2 rows, 8 cols)
- torch.ones(*size): Returns a tensor filled with the scalar value 1, with the
shape defined by the variable argument size. Size can be a sequence of
integers, or a list or tuple.
Ex:
a = torch.ones(2, 2)
B = torch.ones( [2, 2] ) # both a & b have the same shapes i.e. (2,2).
- Arithmetic operations:
Ex:
69
y = x * 10 # multiply elements of tensor x by 10. Same as torch.mul().
y = x.add(20) # add 20 to each element in x.
z = torch.div(x, y) # divide each element of x by the corresponding element of y.
z = torch.sub(x, y) # subtract y from x.
Also see [ torch.matmul(x, y) OR x@y ] for matrix operations.
In PyTorch, when you use the * operator to multiply two 2D tensors A and
B, it performs element-wise multiplication (Hadamard product) rather than matrix
multiplication or dot product.
- Element-wise Multiplication: Using the * operator, as in A * B, will
multiply corresponding elements of the two tensors element-wise. This means
that the result will have the same shape as the input tensors (both inputs should
have the same shape), and each element in the result will be the product of the
corresponding elements in A and B.
- Dot Product: Dot product is the sum of products of values in two same-
sized vectors (scalar output, if inputs are 1D arrays).
If you want to compute the dot product between two vectors, you should use
torch.dot().
Unlike NumPy’s dot, torch.dot() intentionally only supports computing the dot
product of two 1D tensors with the same number of elements. To calculate the
dot product, you can perform the element-wise product and then sum the results
(sum of all elements in the resulting matrix) i.e. C = (A*B).sum().
- Matrix Multiplication: Matrix multiplication is basically a matrix version
nd rd
of the dot product (2 & 3 example in above image).
If you want to perform matrix multiplication between two tensors, you should use
the torch.mm() function (if the input tensors are not 2D, it will raise an error. It
does not support broadcasting) or the @ operator; or torch.matmul() (matmul() is
70
more versatile and can handle a wider range of tensor shapes and operations. It
performs matrix multiplication for 2D tensors like torch.mm(), but it can also
handle higher-dimensional tensors and supports broadcasting for tensors of
different shapes).
Inputs of matrix multiplication should have shape: (m x n) & (n x p) - output
will have shape (m x p).
The result of torch.dot() is a scalar. The result of matrix multiplication
is a matrix, whose elements are the dot products of pairs of vectors in each
matrix.
71
element = t1.item() # element = 0.5
72
Ex:
tens1 = torch.Tensor([0,1,0]) # ‘0’ & ‘1’ both represent different classes.
result = torch.nn.functional.one_hot(tens1.to(torch.long), 2)
# result = [[1,0], [0,1], [1, 0]]
- timeit:
“timeit” provides a method of measuring the execution time of your Python
code snippets.
Ex Code:
import timeit
def test(n):
return sum(range(n))
73
When you pass the code you wish to measure to timeit.timeit() as a string, it
executes the code number times and returns the execution time.
The default value for “number” is 1,000,000. Be aware that running time-
consuming code with the default value can take significant time.
The code is executed in the global namespace by providing globals() as the
globals argument. Without this, the function test and the variable n from the
example would not be recognized.
timeit.timeit() can also accept a callable object. You can specify lambda
expression with no arguments, in which case the globals argument is not
necessary.
result = timeit.timeit(lambda: test(n), number=loop)
print(result / loop)
timeit.timeit() returns the time (in seconds) it took to execute the code “number”
times.
- time: The elapsed time (i.e. time taken for a section of code to execute)
can be estimated using the “time” module:
Ex Code:
import time
# Start timer
start_time = time.time() # returns a float value (current start time in seconds).
# Code to be timed
# ...
# End timer
end_time = time.time()
74
- torch.cuda.stream: This API allows for parallel execution of operations (not
threads) simultaneously on GPU, maximizing usage of available GPU resources.
Allows you to manage and schedule operations on a GPU. It represents an
independent sequence of CUDA operations that can be executed concurrently
with other streams, enabling asynchronous execution of operations on the GPU -
enabling overlap of computation and communication.
Primary functions:
(1) Asynchronous Execution: Streams allow you to perform computations
concurrently on the GPU. You can schedule different operations (kernels,
memory copies, etc.) within different streams, enabling overlap of computation
and communication.
(2) Explicit Control: With streams, you have explicit control over the
execution order of operations on the GPU. This can help in overlapping
computation with data transfers or overlapping different computation tasks,
ultimately improving overall performance.
(3) Resource Management: Streams provide a way to manage
resources on the GPU. They allow you to allocate memory and execute
operations within specific contexts, avoiding contention between different parts of
your code that might be using the GPU simultaneously.
(4) Synchronization: Streams can be synchronized explicitly, ensuring that
operations within a particular stream complete before moving on to the
subsequent ones. This synchronization can be useful in scenarios where the
order of execution matters or when dependencies exist between different
computations.
Example Code:
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
# Initialise cuda tensors here. E.g.:
A = torch.rand(1000, 1000, device = 'cuda')
B = torch.rand(1000, 1000, device = 'cuda')
# Wait for the above tensors to initialise.
torch.cuda.synchronize()
with torch.cuda.stream(s1):
C = torch.mm(A, A)
with torch.cuda.stream(s2):
D = torch.mm(B, B)
# Wait for C and D to be computed.
75
torch.cuda.synchronize()
# Do stuff with C and D.
def f1():
time.sleep(6)
print("f1() complete.")
return
def f2():
time.sleep(3)
print("f2() complete.")
return
def start():
# Create threads for f1 and f2.
thread1 = threading.Thread(target=f1)
thread2 = threading.Thread(target=f2)
76
thread1.join()
thread2.join()
print("Execution completed.")
return
Output:
f2() complete.
f1() complete.
Execution completed.
Usage Ex:
import pydicom as dicom
dicomfile = dicom.dcmread(<file name>)
video = dicomfile.pixel_array # extracts image/video data from dicom file in
compatible form - requires NumPy.
Code:
import cv2
77
result = cv2.VideoWriter(filepath, cv2.VideoWriter_fourcc(*'MJPG'), fps,
(width, height))
for frame_no, frame in enumerate(video):
# optional - display results (18 keypoints per frame here (in “kp”), for illustration).
for i in range(18):
x = int(kp[frame_no, i, 0])
y = int(kp[frame_no, i, 1])
cv2.circle(frame, (x, y), 2, (0,255,0), 3)
result.write(frame)
#cv2.imshow("w1", frame) #optional - display frame.
#cv2.waitKey()
result.release()
return
# create a residual network instance with 101 layers & pretrained weights.
resnet = models.resnet101(weights='IMAGENET1K_V1')
78
resnet
#The resnet variable can be called like a function, taking as input one or more images and
producing an equal number of scores for each of the 1,000 ImageNet classes. Before we can
do that, however, we have to preprocess the input images so they are the right size and so that
their values (colors) sit roughly in the same numerical range. In order to do that, the torchvision
module provides transforms.
from torchvision import transforms
# define a preprocess function that will scale the input image to 256 × 256, crop the image to
224 × 224 around the center, transform it(PIL image) to a tensor, and normalise its RGB
components so that they have defined means and standard deviations. These need to match
what was presented to the network during training, if we want the network to produce
meaningful answers.
# transforms.Normalize(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])]) - This will
normalize the image to have a mean of 0 & standard deviation of 1 { formula: value =
(image pixel value - mean) / std } - Note that this does not mean that the output values will
be in the range 0-1. ToTensor() automatically scales values to range (0,1) - if the PIL Image
belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the
numpy.ndarray has dtype = np.uint8, so to convert the values to a range (-1, 1), use 0.5
instead of 127.5 in mean & std.
transforms.ToTensor() also automatically rearranges the ordering of the dimensions of a
cv2 image(initially [H x W x C]), so that the output image has dimensions [C x H x W]. Input
image should have a shape of length 3, even if it is a grayscale image i.e. [H,W,1] - sending
an image of shape [H,W] to transform.resize does not work.
preprocess = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224), transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]) #
this will bring the mean to 0 (note that range won't necessarily be in 0 to 1, but somewhat near it
ex: around -2 to 2, as happens for most images).
# load a test image so that we can preprocess it & feed to our network.
from PIL import Image
# import image - path is relative to where the current jupyter notebook file is saved.
img = Image.open("images/bobby.png")
img # show image in notebook.
# preprocess this image to convert it into the size, crop, etc used for training.
img_t = preprocess(img)
79
# unsqueeze (Returns a new tensor with a dimension of size one inserted at the specified
position without changing it’s contents; just uses an extra index to access it’s elements), for
input to the neural network. Insert the resultant 1D tensor at dimension dim (0 below i.e. a row)
in the resultant variable (batch_t).
import torch
batch_t = torch.unsqueeze(img_t, 0)
# See the list of predicted labels(in file ‘imagenet_classes.txt’), by loading a text file listing the
labels in the same order they were presented to the network during training.
with open ('imagenet_classes.txt') as f:
labels = [line.strip() for line in f.readlines()]
# Determine the index corresponding to the maximum score in the out tensor. ‘index’ is of the
form tensor([x]).
# torch.max() Returns a named tuple (values, indices) where values(_ below) is the maximum
value of each row of the input tensor in the given dimension dim(1 below); and indices is the
index location of each maximum value found (argmax).
_, index = torch.max(out, 1) # see Fig. 1
# Convert scores into percentage (*100) using softmax(which gives probabilities); along
dimension dim. Dim is the dimension in tensor, to be selected as input.
percent = torch.nn.functional.softmax(out, dim=1)[0]*100
# index is a tensor, so need to get the actual value by referencing the first element as index[0].
Output the label & it’s confidence/percentage.
labels[index[0]] , percent[index[0]].item()
# can also sort the output using sort() that also provides the indices of the sorted values in the
original array, so that we can get listing of top ‘N’ scores.
_, indices = torch.sort(out, descending=True)
# get the 2nd, 3rd best prediction & so on. ‘Indices’ is of the form tensor([ [x, y, z,...] ]). In
indices[0][1], 1st index ([0]) refers to the outermost brackets of the tensor([ [x,y,z,...] ]), 2nd
index [1]) to next inner brackets.
labels[indices[0][1]] , percent[indices[0][1]].item()
labels[indices[0][2]] , percent[indices[0][2]].item()
80
VSCode via Anaconda:
Terminal inside VSCode can also be used to start your own virtual
environment (python -m venv), install required packages inside this venv, etc.
import torch
from torchvision import models
from torchvision import transforms
from PIL import Image
#print("\nStart\n")
81
resNvar = models.resnet101(weights="IMAGENET1K_V1")
preprocess = transforms.Compose([transforms.Resize(256), transforms.CenterCrop(224),
transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229,
0.224, 0.225])])
in_batch_t = torch.unsqueeze(inImgBatch, 0)
resNvar.eval()
output = resNvar(in_batch_t)
#output = tensor([[...]]), so get the 0th dimension.
percentages = torch.nn.functional.softmax(output, dim=1)[0]
82
- Real World Data Representation using Tensors:
IMAGES:
imgVar = imageio.imread(‘<file path & name>’) # load image into numPy array.
imgVar.shape # get size. shape is an alias for size.
imgTrchV = torch.from_numpy(imgVar) # convert to torch tensor.
# imgVar has format (HxWxC). Use permute() to rearrange dimensions. Dimensions are 0
based. Original = (0,1,2). For arranging in (CxHxW), the new format would be (2,0,1). premute()
does not create a new copy, but produces a changed header with new dimensions ordering.
out = imgTrchV.permute(2,0,1) # out tensor format is CHW.
83
- Neural networks exhibit the best training performance when the input data
ranges roughly from 0 to 1, or from -1 to 1 (this is an effect of how their building
blocks are defined). Hence normalization is needed. For image data, one way is
to divide input by 255, another way is to scale input using it’s mean & standard
deviation, so that it has 0 mean & unit std.
TIME SERIES:
(A) Time series: A dataset having C data fields (columns) can have it’s
rows ordered according to different points in time. This means rows will be
ordered i.e. there will be a relationship of rows with respect to other rows as
connected by points of time. The rows then indicate reading taken for those C
data fields (1st dimension) over a period of time, in intervals (say L - 2nd
dimension).
This C x L can be further batched into higher time intervals. For example, if L
represents hourly reading, there can be further batching based on daily readings,
84
using c x L, where L = 24 (24 hours in a day).
85
Example: If we were to design a solution to this problem by hand, we
might decide to build our embedding space by choosing to map basic nouns and
adjectives along the axes. We can generate a 2D space where axes map to
nouns—fruit (0.0-0.33), flower (0.33-0.66), and dog (0.66-1.0)—and adjectives—
red (0.0-0.2), orange (0.2-0.4), yellow (0.4-0.6), white (0.6-0.8), and brown (0.8-
1.0). Our goal is to take actual fruit, flowers, and dogs and lay them out in the
embedding. So, We can map apple to a number in the fruit and red quadrant.
Likewise, we can easily map tangerine, lemon, lychee, and kiwi in the respective
ranges. For dogs and color, we can embed redbone near red; uh, fox perhaps for
orange; golden retriever for yellow, poodle for white, & so on, as shown in below
figure.
Embeddings are often generated using neural networks, trying to predict a word
from nearby words (the context) in a sentence. For example: we could start
from one-hot-encoded words and use a (usually rather shallow) neural network to
generate the embedding. Once the embedding was available, we could use it for
downstream tasks.
One interesting aspect of the resulting embeddings is that similar words end up
not only clustered together, but also having consistent spatial relationships with
other words. For example, if we were to take the embedding vector for apple and
begin to add and subtract the vectors for other words, we could begin to perform
analogies like apple - red - sweet + yellow + sour and end up with a vector very
similar to the one for lemon.
86
More contemporary embedding models—with BERT and GPT-2 making
headlines even in mainstream media—are much more elaborate and are context
sensitive: that is, the mapping of a word in the vocabulary to a vector is not fixed
but depends on the surrounding sentence. On the flip side, even when we deal
with text, improving the pre-learned embeddings while solving the problem at
hand has become a common practice.
(2) Create a variable/instance that holds the model responsible for learning:
Ex: linear_model = nn.Linear(<no of inputs>, <no of outputs>)
87
(3) Create an optimizer, and pass model’s parameters to it.
Ex: opt = optim.SGD(linear_model.parameters(), lr=<learning rate>)
(5) Run the training loop, providing epochs, model, optimizer, loss function,
training & validation data with labels:
def training_loop(epochsN, model, optimizer, lossFn, trainData, trainLabels,
valData, valLabels):
for epoch in range(1, epochsN + 1):
# forward + backward + optimize.
predict = model(trainData) # forward pass.
loss_train = lossFn(predict, trainLabels) # calculate loss.
88
Forward pass
Compute loss
Accumulate gradient of loss (Backward pass)
Update model (optimizer step) with accumulated gradient
¹ => When the above enumeration on DataLoader is called for a given batch of
size “B”, DataLoader calls Dataset’s overridden __get_item__() B number of
times, using an index (that may be random if shuffle=True).
If the entire dataset is too large to fit in memory, we can load the required
data in __get_item__() function, instead of doing it in DataSet’s __init__().
89
NOTE: If shuffle = True, the data is reshuffled into batches after every
epoch (i.e. after every complete (start to end) iteration over all the batches in
dataloader).
PyTorch code:
import torch
import torch.nn as nn
#training.
for epoch in range(n_epochs):
for imgs, labels in train_loader:
batch_size = imgs.shape[0]
outputs = model(imgs.view(batch_size, -1)) # imgs.view() specifies output to
be of 2 dimensions, 1st is batch_size, 2nd is inferred.
loss = loss_fn(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# validation
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64,
shuffle=False)
correct = 0
total = 0
with torch.no_grad():
for imgs, labels in val_loader:
90
batch_size = imgs.shape[0]
outputs = model(imgs.view(batch_size, -1))
_, predicted = torch.max(outputs, dim=1)
total += labels.shape[0]
correct += int((predicted == labels).sum())
print("Accuracy: %f", correct / total)
- Dataset:
The Dataset retrieves our dataset’s features and labels one data point at a time,
for the DataLoader, which then reshuffles & creates mini batches for use in
training.
- For reading csv files, we can use panda (import pandas as pd)
function pd.read_csv(<file name>). This function loads data into a Dataframe.
Some handy operations on Dataframes:
(1) To get all rows from a dataframe “df”, whose specific column
name(say colX) has value “col_value1”:
df = df.loc[:][df["colX"] == "col_value1"]
[:] means select from all rows. First [] after loc, is for rows; 2nd [] is
for columns. This command returns 1 or more rows.
91
(2) To get a row/column value for a column “colX”, that has row “rowX”
with a value “row_value”:
df.loc[df["rowX"] == “row_value”]["colX"]
This returns 1 value (possibly in an array).
(3) We can use the read_csv() arg chunksize(<int>) to load a huge csv file in
chunks,passing the number of lines to read from the file per chunk. This will
cause the function to return a TextFileReader object for iteration. Whatever
operations we need to do after loading the csv, will need to be done in a loop, for
each loaded chunk.
path = "C://Users//Desktop//sample_folder"
dir_list = os.listdir(path)
# files = [f for f in dir_list if os.path.isfile(path + '/' + f)] # obtaining only the files.
print(dir_list)
os.path.isfile() returns true only if the path name corresponds to a file, &
not a folder. Use not(os.path.isfile()) to check for folders. os.getcwd() returns the
current working directory.
<string>.endswith(<extension str>) can be used to check the extension of
the file name, when string contains the path of the file, with extension.
92
This function can be used for manual control of stepping inside subsequent
levels of folder structures - say when manually stepping through training image
files that are grouped inside individual folders (contained in the supplied directory
path) that have their Label as the folder names (i.e. dir path -> { label1 | label2 |
… folders} -> files in corresponding folder).
To load images from disk, see section “Example of operations on images” later in
this doc.
Building NN:
The end result is a model that takes the inputs expected by the first
module specified as an argument of nn.Sequential, passes intermediate outputs
to subsequent modules, and produces the output returned by the last module.
In the code above, the model fans out from 1 input feature to 13
hidden features, passes them through a tanh activation, and linearly combines
the resulting 13 numbers into 1 output feature.
93
2.bias torch.Size([1])
The name of each module (0,2 in above output) in Sequential is just the ordinal
with which the module appears in the arguments. Interestingly, Sequential also
accepts an OrderedDict,4 in which we can name each module passed to
Sequential:
seq_model = nn.Sequential(OrderedDict([
('hidden_linear', nn.Linear(1, 8)),
('hidden_activation', nn.Tanh()),
('output_linear', nn.Linear(8, 1))
]))
Output:
Parameter containing: tensor([-0.0173], requires_grad=True)
94
conda info --envs # get virtual environment path.
Output:
base /opt/anaconda3
FastAPI /opt/anaconda3/envs/FastAPI
PyTorch_1 * /opt/anaconda3/envs/PyTorch_1
- To get the trainable number of parameters in a model, can also use below
code:
num_trainable_params = sum(p.numel() for p in model.parameters() if
p.requires_grad)
95
- One important point of convolution operation is that the positional
connectivity exists between the input values and the output values (input values
in NxN kernel space affect output value). A convolution operation forms a many-
to-one relationship i.e. for an NxN kernel, it will map NxN values to 1 value.
For a given size of the input (i), kernel (k), padding (p), and stride (s), the size of
the output feature map (o) generated is given by:
96
To be precise, a function f(x) is said to be equivariant to a function g
if f(g(x)) = g(f(x)). Ex: If we have a function g which shifts each pixel of the image,
one pixel to the right i.e I’(x,y) = I(x-1,y). If we apply the transformation 'g' on the
image 'I' and then apply convolution, the result will be the same as if we applied
convolution to 'I' and then applied translation 'g' to the output.
i.e. conv(g(I)) = g(conv(I)).
In short, Translation Invariance means that the system produces exactly the
same response, regardless of how its input is shifted. Equivariance means that
the system works equally well across positions, but its response shifts with the
position of the target.
When processing images, this simply means that if we move the input 1
pixel to the right then it’s representations will also move 1 pixel to the right. The
property of translational equivariance is achieved in CNN’s by the concept of
weight sharing (of kernel). As the same weights are shared across the images,
hence if an object occurs in any image it will be detected irrespective of its
position in the image. This property is very useful for applications such as image
classification, object detection, etc where there may be multiple occurrences of
the object or the object might be in motion.
For e.g: if you are building a model to detect faces all you need to detect is
whether eyes are present or not (translation equivariance), it’s exact position is
not necessary. While in segmentation tasks, the exact position is required.
CNNs are not naturally equivariant to some other transformations such as
changes in the scale or rotation of the image. Other mechanisms are required to
handle such transformations.
97
pooling operation, we replace the output of the convnet at a certain location with
a summary statistic of the nearby outputs such as a maximum in case of Max
Pooling. As we replace the output with the max in case of max-pooling, hence
even if we change the input slightly, it won’t affect the values of most of the
pooled outputs. Translational Invariance is a useful property where the exact
location of the object is not required.
CNNs provide the three basic advantages over the traditional fully connected
layers:
(1) Firstly, they have sparse connections (processing local data
instead of entire image via kernel size) instead of fully connected connections
which lead to reduced parameters and make CNN’s efficient for processing high
dimensional data.
(2) Secondly, weight sharing takes place where the same (kernel)
weights are shared across the entire image, causing reduced memory
requirements as well as translational equivariance.
(3) Thirdly, CNN’s use a very important concept of subsampling or
pooling in which the most prominent pixels are propagated to the next layer
dropping the rest. This provides a fixed size output matrix which is typically
required for classification and invariance to translation, rotation.
98
- The torch.nn module provides convolutions for 1, 2, and 3 dimensions:
nn.Conv1d for time series, nn.Conv2d for images, and nn.Conv3d for volumes or
videos.
nn.Conv2d(<input/channels>, <output channels>, <size of kernel>) - More
channels in the output image, means more the capacity of the network. We need
the channels to be able to detect many different types of features. Channels per
convolution is the same as the number of neurons per layer, in a traditional NN.
Essentially, when an image is convolved by multiple filters, the output has
as many channels as there are filters that the image is convolved with.
In general, the more filters there are in a CNN, the more features of an
image that the model can learn about.
Ex:
nn.Conv2d(3, 16, kernel_size=3) # input is 3 channel (RGB), output is 16.
kernel_size=(<height>, <width>) - if single value, it means (<value>, <value>).
For nn.Conv3d(), kernel_size is (v1, v2, v3) i.e. 3 values, indicating a 3D kernel.
99
Average the four pixels: This average pooling was a common
approach early on but has fallen out of favour somewhat.
Adaptive Avg/Max Pooling: In AdaptiveAvgPool2D(), we specify the
output feature map size instead (can work as a kind of global pooling too). The
layer automatically computes the kernel size so that the specified feature map
size is returned. The major advantage with this layer is that whatever the input
size, the output from this layer is always fixed and, hence, the neural network can
accept images of any height and width. Also see AdaptiveMaxPool2D().
100
Depthwise convolutions (used in MobileNet architecture) are faster & compact
than regular convolution operations, as they require fewer computations to get
the same result for a corresponding convolution operation of a given size.
101
This delivers a wider field of view at the same computational cost. Dilated
convolutions are particularly popular in the field of real-time segmentation. Use
them if you need a wide field of view and cannot afford multiple convolutions or
larger kernels.
The technique of atrous convolutions enables the model to capture larger
patterns (with fewer parameters) and performs well (say, in detecting the number
of people in a crowd).
Transposed Convolution:
102
does a deconvolution. It is the mathematical inverse of what a convolutional layer
does.
A transposed convolutional layer carries out a regular convolution but reverts its
spatial transformation. To achieve this, we need to perform some fancy padding
on the input.
Transposed Convolution operation - The values of padding and stride are the
one that hypothetically was carried out on the output to generate the input. i.e. if
you take the output, and carry out a standard convolution with stride and padding
defined, it will generate the spatial dimension same as that of the input.
103
Step 4: Carry out standard convolution on the image generated from step 3
with a stride length (s' in figure above) of 1.
For a given size of the input (i), kernel (k), padding (p), and stride (s), the size of
the output feature map (o) generated is given by:
class Net(nn.Module):
def __init__(self):
super().__init__() # always call this.
# registering submodules in the constructor so that parameters() call can find all such
submodules defined at top level in this class. Below objects are available till module exists.
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.act1 = nn.Tanh()
self.pool1 = nn.MaxPool2d(2)
104
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
self.act2 = nn.Tanh()
self.pool2 = nn.MaxPool2d(2)
self.fc1 = nn.Linear(8 * 8 * 8, 32)
self.act3 = nn.Tanh()
self.fc2 = nn.Linear(32, 2)
105
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
self.fc1 = nn.Linear(8 * 8 * 8, 32)
self.fc2 = nn.Linear(32, 2)
Another use case can be like performing in-place operations - being memory
efficient.
106
Their weights can also be accessed such as modelVar.conv1.weight.
- Saving/Loading a model:
We can save a model using:
torch.save(<model variable>.state_dict(), <path with filename.pt>)
The saved file now contains all the parameters of the model,
but no structure(i.e. architecture), only weights.
So, for loading the saved model, we need to have our model class handy,
instantiate it, then load it’s parameters with this file. Ex:
modelVar = Net()
modelVar.load_state_dict(torch.load(<path to file.pt>))
load_state_dict() copies parameters and buffers from state_dict into this module
and its descendants (in case module derived class contains other nn.Module
derived modules).
A common PyTorch convention is to save models using either a .pt or .pth file
extension.
NOTE: Using the above method, you will need the model definition(class) to
load the state_dict (after creating the model object using the model class
blueprint).
A good practice is to transfer the model to the CPU before calling torch.save, as
this will save tensors as CPU tensors and not as CUDA tensors. This will help in
loading the model on any machine, whether it contains CUDA capabilities or not.
Ex:
torch.save(<model variable>.to(‘cpu’).state_dict(), …)
NOTE: When loading files from a weights file in a virtual environment,
the path to the file should be relative to the script that is running this command;
or absolute path.
107
Saving/Loading entire model: We can also directly save the model itself,
along with it’s parameters, using torch.save(<model>) (instead of
torch.save(<model>.state_dict())). In that case, we need to load it in the same
way i.e. using <model> = torch.load(<path>), instead of using
<model>.load_state_dict(torch.load(<path>)).
NOTE: When using above method, following are the cons:
Since Python's pickle module is used internally, the serialized data (saved model)
is bound to the specific classes and the exact directory structure. Pickle simply
saves a path to the file containing the specific (model) class.
As you can imagine, the code might break after refactoring as the saved model
might not link to the same path. Using such a model in another project is hard as
well since the path structure needs to be maintained.
Remember that you must call model.eval() to set dropout and batch
normalization layers to evaluation mode before running inference. Failing to do
this will yield inconsistent inference results.
108
To load your serialized PyTorch model in C++, your application must depend on
the PyTorch C++ API – also known as LibTorch.
Ex:
#include <torch/script.h>
torch::jit::script::Module module;
// Deserialize the ScriptModule from a file using torch::jit::load().
module = torch::jit::load(<file name>);
Usually, your ML pipeline will save the model checkpoints periodically or when a
condition is met. Usually, this is done to resume training from the last or best
checkpoint. It is also a safeguard in case the training gets disrupted due to some
unforeseen issue.
However, saving the model's state_dict is not enough in the context of the
checkpoint. You will also have to save the optimizer's state_dict, along with the
last epoch number, loss, etc. Basically, you might want to save everything that
you would require to resume training using a checkpoint.
Ex:
SAVE:
torch.save({'epoch': EPOCH,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': LOSS},
'save/to/path/model.pth')
LOAD:
109
model = MyModelDefinition(args)
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
checkpoint = torch.load('load/from/path/model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
We can move the tensors we get from the data loader to the GPU, after
which our computation will automatically take place there. But we also need to
move our parameters to the GPU. nn.Module also provides a .to() function, just
like Tensor.to(). Module.to() moves all of it’s parameters to the GPU.
Module.to() is in place: the module instance is modified. But Tensor.to is
out of place - returns a new tensor. One implication is that it is good practice to
create the Optimizer after moving the parameters to the appropriate device.
It is considered good style to move things to the GPU if one is available. A good
pattern is to set the a variable device depending on torch.cuda.is_available:
device = (torch.device('cuda') if torch.cuda.is_available()
else torch.device('cpu'))
print(f"Training on device {device}.")
Then we can amend the training loop by moving the tensors we get from the data
loader to the GPU by using the Tensor.to method.
Example Training Loop:
for imgs, labels in train_loader:
imgs = imgs.to(device=device) # move “imgs” to GPU.
labels = labels.to(device=device) # move “labels” to GPU.
outputs = model(imgs)
loss = loss_fn(outputs, labels)
110
There is a slight complication when loading network weights: PyTorch will
attempt to load the weight to the same device it was saved from—that is, weights
on the GPU will be restored to the GPU. As we don’t know whether we want the
same device, we have two options: we could move the network to the CPU
before saving it, or move it back after restoring.
It is a bit more concise to instruct PyTorch to override the device information
when loading weights. This is done by passing the map_location keyword
argument to torch.load:
loaded_model.load_state_dict(torch.load(data_path + 'birds_vs_airplanes.pt',
map_location=device))
- Regularization:
Training a model involves two critical steps: optimization, when we need the
loss to decrease on the training set; and generalization, when the model has to
work not only on the training set but also on data it has not seen before, like the
test set. These 2 steps come under regularization.
They add penalty terms to the loss function that encourage the model's
weights to be small.
111
Both of them are scaled by a (small) factor, which is a hyperparameter we set
prior to training.
Note that weight decay applies to all parameters of the network, such as biases.
Ex:
Implementing L1 regularization in model:
# code is inside training loop.
def training_loop(...):
…
L1_regularization = 0 # will hold the regularization (absolute sum of
weights).
for param in model.parameters():
L1_regularization += torch.norm(param,1) # L1 regularization.
torch.norm() provides the absolute value of the weight and bias values across layers.
batch_loss = loss_fn(prediction, y) + 0.0001*L1_regularization # add
this penalty (scaled by 0.0001) to loss.
batch_loss.backward()
(2) Dropout: The idea behind dropout is: zero out a random fraction
of outputs from neurons across the network, where the randomization happens at
each training iteration.
By dropping some connections in ANN we are forcing networks to
learn from fewer resources. This forces the model to generalize.
This procedure effectively generates slightly different models with different neu-
ron topologies at each iteration, giving neurons in the model less chance to
coordinate in the memorization process that happens during overfitting.
We can use nn.Dropout(<probability>) module to introduce dropouts in our
model, between the nonlinear activation function (in the current layer) and the
linear or convolutional module of the subsequent layer.
112
Placement of Dropout layer.
Ex:
# in init() of subclass.
self.conv1_dropout = nn.Dropout2d(p=0.4)
# in forward().
out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
out = self.conv2_dropout(out) # use dropout.
During inference too, we use the dropout layer that was used during training.
This means that all the units are considered during the prediction step. But,
because of taking all the units/neurons from a layer, the final weights will be
larger than expected and to deal with this problem, weights are first scaled by the
chosen dropout rate (this is is to make sure that the distribution of the values
after affine transformation during inference time is close to that during training
time). With this, the network would be able to make accurate predictions.
113
To be more precise, if a unit is retained with probability p during training,
the outgoing weights of that unit are multiplied by p during the prediction stage.
The main idea behind batch normalization is to rescale the inputs (even the
inputs of the hidden layers) to the activations of the network so that minibatches
have a certain desirable distribution (It’s possible that the input distribution at a
particular layer keeps changing across batches). This helps avoid the “inputs
to activation functions” being too far into the saturated portion of the
function, thereby killing gradients and slowing training.
Batch normalization shifts and scales an (intermediate) input using the mean and
standard deviation collected at that (intermediate) location over the samples of
the minibatch. Pytorch provides nn.BatchNorm1D, nn.BatchNorm2d, and
nn.BatchNorm3d modules, depending on the dimensionality of the input.
114
using above BatchNorm layers to normalize such values & bring them into
acceptable range.
Batch Normalisation helps to avoid the issue of gradients becoming so small that
the weights are barely updated, especially in deep NNs.
Limitations:
Two limitations of batch normalization can arise:
(1) In batch normalization, we use the batch statistics: the mean and
standard deviation corresponding to the current mini-batch. However, when the
batch size is small, the sample mean and sample standard deviation are not
representative enough of the actual distribution and the network cannot learn
anything meaningful.
(2) As batch normalization depends on batch statistics for normalization,
it is less suited for sequence models (we use Layer normalization in sequences).
This is because, in sequence models, we may have sequences of potentially
different lengths and smaller batch sizes corresponding to longer sequences.
Layer Normalization (LN) normalizes the activations along the feature dimension.
Since it doesn’t depend on batch dimension, it’s able to do inference on only one
data sample.
IN is very similar to Layer Normalization but the difference between them is that
IN normalizes across each channel in each training example (i.e. per channel per
115
example), whereas LN normalizes across all features in each training example
(i.e. all features per example).
Unlike BN, IN layers use instance statistics computed from input data in
both training and evaluation mode.
Similar to LN, GN is also used along the feature dimension, but it divides the
features into groups and normalizes each group respectively.
116
return out
So, the way to implement skip connections is to just arithmetically add earlier
intermediate outputs to downstream intermediate outputs.
nn.ReLU(inplace=True) inplace=True means that it will modify the input
directly, without allocating any separate memory for additional output. It can
sometimes slightly decrease the memory usage, but may not always be a valid
operation (because the original input is destroyed).
Similarly, torch.sigmoid_() is an inplace operation, whereas torch.sigmoid() is not.
(1) Import the relevant packages (ex: torch, torchvision, numpy, etc).
(2) Build a dataset that can fetch data one data point at a time.
(3) Wrap the DataLoader from the dataset.
(4) Build a model and then define the loss function and the optimizer.
(5) Define two functions to train and validate a batch of data, respectively.
(6) Define a function that will calculate the accuracy of the (validation) data.
(7) Perform weight updates based on each batch of data over increasing epochs,
till desired accuracy (on validation dataset) & acceptable loss values (on training
dataset) are obtained. Also, plot the accuracies & losses in a graph, over the
epochs iteration.
117
- Example of operations on images:
# crop image by selecting between start & end rows, start & end columns.
img = img[<rowStart> : <rowEnd> , <columnStart> : <columnEnd>]
# convert cv2 image to pytorch tensor. Can use below technique for “PIL Image” too.
t1 = transforms.ToTensor()
imgTensor = t1(img) # Format (C x H x W).
118
plt.title(‘title’) # plot title.
plt.plot(valuestensorX, valuesTensorY, label=’some label’) # plot data.
- Data Augmentations:
119
tuple will contain the yth element of all of the iterables provided. Ex: zip(['a', 'b',
'c'], [1, 2, 3]) -> ('a', 1) ('b', 2) ('c', 3).
This will help to reorganize the batch data such that all inputs have the
same size, as required by DataLoaders.
120
algorithms aim towards learning to compare two images to come up with a score
for how similar the images are.
Here, we input a few (input) images of each class to the network while
training and ask it to predict the class for a new (query) image based on the
images. So far, we have been using pre-trained models to solve such problems.
However, such models are likely to overfit soon, given the tiny amount of data
that is available.
We can leverage multiple metrics, models, and optimization-based
architectures to solve such scenarios. We use metric-based architectures that
come up with an optimal metric, either a Euclidean distance or cosine similarity,
to group similar images together and then predict on a new image.
An N-shot k-class classification is where there are N images each for
the k classes to train the network.
Uses of such similarity measures can be found in applications for hand
written checks, face recognition, etc.
- Siamese Networks:
121
samples closer and dissimilar samples far apart) to train the network - more
formally, we suppose that we have a pair (Ii, Ij) and a label Y that is equal to 0 if
the samples are similar and 1 otherwise. To extract a low-dimensional
representation of each sample, we use a CNN f that encodes the input images Ii
and Ij into an embedding space where xi = f(Ii) and xj = f(Ij). The contrastive loss
is defined as:
L = (1-Y) * (|xi - xj|)2 + Y * max(0, m - (|xi - xj|)2)
where m is a hyperparameter, defining the lower bound distance between
dissimilar samples.
Contrastive loss is used to penalize the network (during training) for
marking 2 images as dissimilar (high metric value) when they are similar; as well
as for marking 2 images as similar (low metric value) when they are dissimilar.
Twin networks are also used for Object Tracking, because of its unique two
tandem inputs and similarity measurement.
- Prototypical Networks:
122
In the preceding example illustration, there are three classes and each
circle represents the embeddings of the images belonging to the class. Each star
(prototype) is the average embedding across all the images (circles) present in
the image.
- Relation Networks:
A relation network is fairly similar to a Siamese network, except that the metric
we optimize for is not the L1 distance between embeddings but a relation score.
In the preceding diagram, the pictures on the left are the support set for
five classes and the dog image at the bottom is the query image:
(a) Pass both the support and query images through an embedding module,
which provides embeddings for the input image.
(b) Concatenate the feature maps of the support images with the feature maps
of the query image.
(c) Pass the concatenated features through a CNN module to predict the
relation score.
The class with the highest relation score is the predicted class of the query
image.
123
(A) CLASSIFICATION:
1) Example Classifier code for training Cats & Dogs images on disk:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
from matplotlib import pyplot as plt
import os
import cv2
# this class loads the data from the specified train/test folder for Cat-Dog-
Classification, & prepares the input data (transforms).
class CatDogDataSet(Dataset):
output_classes = [] # class variable - contains the output classes.
training_set_str = "/training_set"
test_set_str = "/test_set"
# usually, this should contain just the list of the file names in training (or
validation) dataset.
def __init__(self, path, isTraining) -> None:
super().__init__()
self.data = []
self.labels = [] # this will be converted to one-hot tensor just before
training/validation, using torch.nn.functional.one_hot().
# get output classes from training(or test) set. exclude hidden folders.
CatDogDataSet.output_classes = [folderStr for folderStr in
os.listdir(path+CatDogDataSet.training_set_str) if not(folderStr.startswith('.'))]
# convert to tensor (values range = (0,1)), resize shortest edge of input
image to size=128, then use center-crop to make image size (128, 128) by cropping from
center point.
self.preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(128), transforms.CenterCrop(128)])
124
# Here, entire data is being loaded in init() itself.
self.load_data(path, isTraining)
#print(len(self.data), len(self.labels))
# returns the length of the total data in training (or validation) set.
def __len__(self):
return len(self.data)
# Returns the data at specified index. Usually, data should be loaded into memory
here (from the filename-string at index), & then returned.
def __getitem__(self, index):
return self.data[index], self.labels[index] # labels[index] returns a
tensor of shape (1, len(output_classes)).
125
path = path + sub_path
import torch
from torch.utils.data import DataLoader
from torch.nn import Module
import torch.optim as optim
from CatDogDataset import CatDogDataSet
126
optimizerFn.zero_grad()
loss.backward() # backward pass.
optimizerFn.step() # update weights.
# convert labels from one-hot to class indices. For validation, using class
indices is better, for ease of computation i.e. (predictions == labels).sum().
#labels = torch.argmax(labels, dim=1)
127
# predictions & labels are 2D tensors when using one-hot-encoding, so need to
divide by number of output classes & use the int() value.
#correct_predictions = (predictions == labels).sum() /
CatDogDataSet.get_number_of_output_classes() ###
correct_predictions = (predictions == labels).sum()
accuracy_ratio = correct_predictions.int().item() / total
return accuracy_ratio, correct_predictions.item()
import torch
import torch.nn as nn
import torchvision
class DogCatClassifierModel(nn.Module):
#model_name = "cat_dog_classifier_model.pth" # model is saved with this name.
#model_name = "cat_dog_classifier_model_arch2.pth"
model_name = "cat_dog_classifier_model_arch3.pth"
# model architecture.
# input image size (after applying transforms) = 128 x 128 pixels.
#self.model = self.model_arch1(output_classes)
#self.model = self.model_arch2(output_classes)
self.model = self.model_arch3(output_classes)
128
self.model.eval()
out = self.model(inputs)
return out
129
nn.Conv2d(32, 8, kernel_size=3), # [32, 30, 30] -> [32, 28, 28]. input
channels = 32, output channels = 8. due to kernel size=3, reduces by 2 pixels:(30-
2=28).
nn.MaxPool2d(2), # [32, 28, 28] -> [8, 14, 14]
(28/2=14)
nn.ReLU(),
130
# difference from model_arch2() is that number of channels are increasing till
end.
return nn.Sequential(
self.conv_layer(3, 16, 3), # [16, 128, 128] -> [16, 63, 63] ( (128-
2)/2=63 ), kernel size=3.
self.conv_layer(16, 64, 3), # [16, 63, 63] -> [64, 30, 30] ( (63-
2)/2=30 )
self.conv_layer(64, 128, 3), # [64, 30, 30] -> [128, 14, 14] ( (30-
2)/2=14 )
self.conv_layer(128, 256, 3), # [128, 14, 14] -> [256, 6, 6] ( (14-
2)/2=6 )
self.conv_layer(256, 512, 3), # [256, 6, 6] -> [512, 2, 2] (
(6-2)/2=2 )
nn.Flatten(), # make 1D (512 * 2 * 2 = 2048)
nn.Linear(2048, 50),
nn.ReLU(),
nn.Linear(50, output_classes)
)
# create blocks that can be copied in a loop when create model layers in
architecture.
# n_i = input channels, n_o = output channels.
def conv_layer(self, n_i, n_o, kernel_size):
return nn.Sequential(
nn.Conv2d(n_i, n_o, kernel_size), # width & height (-2)
nn.BatchNorm2d(n_o),
nn.ReLU(),
nn.MaxPool2d(2) # width & height (/2)
)
- Main.py code:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn import Module
import torch.optim as optim
from CatDogDataset import CatDogDataSet
from DogCatClassifierModel import DogCatClassifierModel
from train_validate import train, validate
import os
131
# ### in code -> using 1D tensor instead of one-hot encoding. Using class indices
instead of one-hot-encoding (when using CrossEntropyLoss) doesn't work (on
predictions), as when we use torch.argmax() or related functions, they are not
differentiable (no gradients), hence loss cannot be computed.
132
# runs validation on validation set, using a model trained using run_training().
def run_validation():
CDDataset = CatDogDataSet(path, False) # prepare validation data.
dLoader = DataLoader(CDDataset, batch_size=200, shuffle=False)
model = DogCatClassifierModel(CatDogDataSet.get_number_of_output_classes())
model.training = False
result = model.load_state_dict(torch.load(DogCatClassifierModel.model_name))
# load model weights from disk, into model variable.
print("----------Validation Started--------------")
validate(dLoader, model)
print("----------Validation Complete--------------")
print("\n")
Transfer Learning:
133
1) Normalize the input images, normalized by the same mean and
standard deviation that was used during the training of the pre-trained model.
2) Fetch the pre-trained model's architecture & its trained weights.
3) Discard the last few layers of the pre-trained model.
4) Connect the truncated pre-trained model to a freshly initialized layer
(or layers) where weights are randomly initialized. Ensure that the output of the
last layer has as many neurons as the classes/outputs we would want to predict.
5) Ensure that the weights of the pre-trained model are not trainable (in
other words, frozen/not updated during backpropagation, by setting
requires_grad = False), but that the weights of the newly initialized layer and the
weights connecting it to the output layer are trainable (we do not train the
weights of the pre-trained model, as we assume those weights are already well
learned for the task, and hence leverage the learning from a large model. In
summary, we only learn the newly initialized layers for our small dataset).
6) Train the trainable parameters of the model via usual training
techniques.
Tips:
134
layers at more depth (i.e. where higher level features are learnt; to speed up
convergence or fine-tune their learning).
This allows for finer control over how learning occurs.
Example Code:
optimizer = optim.AdamW([{'params': model.module1.parameters(), 'lr':3e-4},
{'params': model.module2.parameters(), 'lr': 5e-7}])
Above code specifies different learning rates (3e-4 & 5e-7) to different
modules in the model i.e. module1 & module2, as a list of dictionaries.
# The models module in the torchvision package hosts the various pre-trained models available
in PyTorch.
from torchvision import models
device = 'cuda' if torch.cuda.is_available() else 'cpu'
from torchsummary import summary
135
# Load the VGG16 (pretrained) model.
model =
models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1).to(device)
summary(model, torch.zeros(1, 3, 224, 224)) # get the model
summary/architecture.
print(model) # prints another form of summary(shows grouping under features,
avgpool, classifier sections) of the model.
# Above simple printing of “mode” reveals that VGG has 3 main modules: features, avgpool &
classifier. We usually freeze features & avgpool, training on a new classifier module only. Delete
the classifier module (or only a few layers at the bottom) and create a new one in its place that
will predict the required number of classes corresponding to our “cats-dogs” dataset (instead of
the existing 1,000).
# (1) transform training images similar to ones used in training VGG initially.
img = cv2.resize(img, (224, 224)) # can use transforms.Resize() & .CenterCrop() too;
while loading input images.
transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
# NOTE: While leveraging pre-trained models, it is mandatory to resize, permute, and then
normalize images.
# Define the loss function according to our cat-dogs requirement i.e. binary loss (can continue
using CrossEntropyLoss too).
loss_fn = nn.BCELoss()
136
(B) Transfer learning by modifying Resnet18 Architecture:
While so far, we have been interested in extracting the F(x) value, where x is the
value coming from the previous layer, in the case of a residual network, we are
extracting not only the value after passing through the weight layers, which is
F(x), but are also summing up F(x) with the original value, which is x.
So far, we have been using standard layers that perform either linear
or convolution transformations F(x) along with some non-linear activation. Both of
these operations in some sense destroy the input information. For the first time,
we are seeing a layer that not only transforms the input, but also preserves it, by
adding the input directly to the transformation – F(x) + x.
137
The skip connections are made after every 2 layers.
138
# add the previous layer input (inputs) to the current processing (self.conv) layer.
def forward(self, inputs):
outputs = self.conv(inputs) + inputs
return outputs
# Define a class for Resnet MODEL (DGTransferResNet18Model) & load pretrained weights.
def __init__(self, output_classes) -> None:
self.transfer_model =
models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Inspecting model - contains sections - convolution, batch normalization, ReLU, MaxPooling, 4
layers of ResNet blocks, avgpool, fully connected layer (FC).
#print(self.transfer_model)
139
Facial key points denote the markings of various keypoints on the image that
contains a face.
Problems to be solved for keypoints problem:
(a) All the input images (with different sizes) need to be resized to the
same size. Hence, its labeled key points would also need to be adjusted
accordingly - Solution is, to have the keypoints represented as a value
between 0 & 1, where 0 & 1 represents the start & end size of the image
respectively i.e. keypoints location will be relative to new image
coordinates/dimensions from 0 - 1. Since the values are between 0-1, we can
use sigmoid activation at the final layer to get the outputs.
Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import pandas as pd
import os
import cv2
# "FaceDataset" class loads the image file names & their raw keypoint data at init.
During getitem(), it returns the loaded image, & it's normalized keypoint data in a 1D
tensor.
# Steps:
# init(): get list of (<image file name>, <keypoints for that image>). Keypoints are
in form <x1, y1, x2, y2,....,xN, yN>.
# getitem(): load the image in memory. Normalize the keypoints location according
to original image's dimensions(i.e. between 0 to 1). Return (<loaded image>,
<keypoints>). Returned keypoints are in the form <x1, x2,....,xN, y1, y2,..., yN>.
140
class FaceDataset(Dataset):
# class variable. Initialized during import of this class itself. Setup pre-
processing, for future use in __getitem__().
preprocess = transforms.Compose([transforms.ToTensor(), transforms.Resize(224),
transforms.CenterCrop(224), transforms.ConvertImageDtype(torch.float32),
transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])])
def __len__(self):
return len(self.data)
141
image = FaceDataset.preprocess(image)
# also preprocess ground truth i.e. label data.
label_data = self.preprocess_label(label_data, image_shape)
return (image.to(device), label_data.to(device)) # shift to appropriate
device.
# extracts label for provided file name. Assumes that 1st field in csv has this
file name as parameter.
def extract_label_for_filename(self, file_name):
# extract required label data from "self.label_data". here key(image name) is
in column-number=0.
label_data = self.label_data.loc[self.label_data.iloc[:, 0] == file_name]
if label_data.empty:
return None # if "test" folder contains images that do not have
corresponding entries in "test_frames_keypoints.csv".
# remove the first column that contains the image name.
142
label_data = label_data.iloc[:, 1:]
# convert to (1D) pandas dataframe to tensor.
label_data = torch.tensor(label_data.iloc[0].values).float()
return label_data
# static method - loads input image & also returns the preprocessed image, for
input to model prediction.
def get_input_processed_image(image_path):
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #/255
height, width, _ = image.shape
preprocessed_image = FaceDataset.preprocess(image)
# "image" is not a tensor.
return image, preprocessed_image.to(device), height, width
Model:
import torch
import torch.nn as nn
import torchvision
from torchvision import models
import cv2
class FaceKPModel(nn.Module):
143
def __init__(self) -> None:
super().__init__()
self.model_name = "trained_models/FaceKP_VGG16.pth"
self.model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
for param in self.model.parameters():
param.requires_grad = False # freeze weights.
#self.model_arch1()
#self.model_arch2()
self.model_arch3()
#self.model_arch4()
144
# this classifier has less Dropout(), resulting in slightly lesser loss.
self.model.classifier = nn.Sequential(nn.Linear(2048, 512), nn.ReLU(),
nn.Dropout(0.1), nn.Linear(512, 136), nn.Sigmoid())
Train/Validate:
import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from FaceDataset import FaceDataset
import cv2
#from torch.utils.tensorboard import SummaryWriter
145
model.train()
total_loss += loss.item()
with torch.no_grad():
for _, [inputs, labels] in enumerate(dLoader):
predictions = model(inputs)
accuracy_ratio = compute_accuracy(predictions, labels, lossFn)
total += 1 # total is just the number of times this loop runs. As
accuracy is a ratio, total accuracy = (total_accuracy / total).
total_accuracy += accuracy_ratio
print("\n")
print(f"Total Validation loss: {total_accuracy / total}")
print("\n")
146
# this will compute the accuracy given predictions & labels. Accuracy is nothing but
the computed loss.
def compute_accuracy(predictions, labels, lossFn):
loss = lossFn(predictions, labels)
return loss.item()
import pandas as pd
147
# can compare "temp_prediction" with "o1" (actual keypoints from csv) whole
debugging in data viewer.
"""
#temp
Main.py:
import torch
import torch.nn as nn
from FaceDataset import FaceDataset
from torch.utils.data import DataLoader
from FaceKeypointModel import FaceKPModel
import torch.optim as optim
from train_validate import train, validate, plot_keypoints_on_image
148
def run_training():
training_path = "../dataset/training"
label_file_path = "../dataset/training_frames_keypoints.csv"
faceKPdataset = FaceDataset(training_path, label_file_path)
dLoader = DataLoader(faceKPdataset, batch_size=100, shuffle=True)
model = FaceKPModel().to(device)
#lossFn = nn.L1Loss() # L1Loss = Mean Absolute Error(MAE).
lossFn = nn.MSELoss()
optimizerFn = optim.Adam(model.parameters(), lr=0.0005)
epochs = 20
# start training.
print("------------Training Started-------------")
train(epochs, dLoader, model, lossFn, optimizerFn)
print("------------Training Complete-------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
149
def run_validation():
test_path = "../dataset/test"
label_file_path = "../dataset/test_frames_keypoints.csv"
faceKPdataset = FaceDataset(test_path, label_file_path)
dLoader = DataLoader(faceKPdataset, batch_size=100, shuffle=False)
model = FaceKPModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
lossFn = nn.L1Loss()
print("------------Validation Started-------------")
validate(dLoader, model, lossFn)
print("------------Validation Complete-------------")
image_path = "../dataset/training/Adrian_Nastase_42.jpg"
image, preprocessed_image, height, width =
FaceDataset.get_input_processed_image(image_path)
model = FaceKPModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
plot_keypoints_on_image(model, preprocessed_image, image, height, width)
#run_training()
#resume_training("trained_models/FaceKP_VGG16.pth")
#run_validation()
test_faceKP_on_image()
150
- torch_snippets library:
Many of the code in model training have common & repetitive code that has to be
repeated every time. torch_snippets library provides one line functions for such
common tasks, that shortens our development time & provides convenience.
Many operations, for example: reading an image, showing an image,
the entire training loop, etc are repetitive & can be reused in single function calls.
Moreover, subtle things are taken care of by torch_snippets library, such as when
reading images using cv2, images are internally converted into [C x H x W] as
required by Pytorch, & so on.
Torch_snippet can be installed using pip: pip install torch-snippets.
Additional dependencies:
pip install fitz
pip install PyMuPDF
151
Database link:
https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/
blob/master/Chapter05/age_gender_prediction.ipynb
(Open link in google colab & download database from code -
getFile_from_drive() section).
- Calculate the overall loss by summing the losses for age & gender, &
perform backpropagation on this overall loss.
Code Snippets:
Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
import numpy as np
class AgeGenderDataset(Dataset):
AGE_RANGE = 80 # the actual scale/range of age (0 to 80).
152
preprocess = transforms.Compose([transforms.ToTensor(), transforms.Resize(224),
transforms.CenterCrop(224), transforms.ConvertImageDtype(torch.float32),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
def __len__(self):
return len(self.input_df)
153
# static method to prepare single sample input "input_name" for prediction.
def test_input(input_name):
image = cv2.imread(input_name)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)/255
image = AgeGenderDataset.preprocess(image)
image = torch.unsqueeze(image, 0) # add batch dimension to single input
image.
return image.to(device)
Model:
import torch
from torchvision import models
import torch.nn as nn
from AgeGenderClassifierSubModule import AgeGenderClassifierSubModule
class AgeGenderModel(nn.Module):
model_name = "trained_models/AgeGenderModel_VGG16.pth"
Model - AgeGenderClassifierSubModule:
import torch
154
import torch.nn as nn
class AgeGenderClassifierSubModule(nn.Module):
def __init__(self) -> None:
super().__init__()
self.intermediate = nn.Sequential(nn.Linear(2048, 512), nn.ReLU(),
nn.Linear(512, 128), nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, 64), nn.ReLU())
# age classifier (1 output neuron with sigmoid). Age prediction is between (0-
1) to be scaled by 80.
self.age_classifier = nn.Sequential(nn.Linear(64, 1), nn.Sigmoid())
# gender classifier (1 output neuron with sigmoid). Gender prediction is
modified to either 0 or 1 as output, from sigmoid's output.
self.gender_classifier = nn.Sequential(nn.Linear(64, 1), nn.Sigmoid())
Train/Validate:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
155
def train(epochs, model:nn.Module, dLoader:DataLoader, age_lossFn, gender_lossFn,
optimFn:optim):
model.train()
torch.autograd.set_detect_anomaly(True)
for epoch in range(epochs):
for _, [input_image, age_labels, gender_labels] in enumerate(dLoader):
optimFn.zero_grad()
age_predictions, gender_predictions = model(input_image)
# # BCELoss() expects both predictions & labels to be of type float.
BCELoss seems to calculated as: (number of incorrect predictions / total predictions).
age_predictions = age_predictions.squeeze().to(device) # remove all
dimensions of size 1 from tensor's shape. "age_predictions" original shape = [<batch
size>,1], while "age_labels" shape=[<batch size>].
age_loss = age_lossFn(age_predictions, age_labels)
# If gender classification module has 2 outputs(one-hot encoded style):
convert "gender_predictions" from one-hot to 1D tensor, like labels. Since we are
using only 1 output neuron for gender classification in our model, no need to use one-
hot-encoding conversion to 1D (using argmax() below), as output for gender
classification is already in 1D form.
#gender_predictions = torch.argmax(gender_predictions, dim=1)
gender_predictions = gender_predictions.squeeze().to(device) # remove
all dimensions of size 1 from tensor's shape. "gender_predictions" original shape =
[<batch size>,1].
gender_loss = gender_lossFn(gender_predictions, gender_labels)
# sum both losses to get total loss, that we will use to backpropagate.
total_loss = age_loss + gender_loss
total_loss.backward()
optimFn.step()
print(f"Epoch:{epoch} - Training: Age Loss = {age_loss.item()} , Gender Loss =
{gender_loss.item()} , Total Loss = {total_loss.item()}")
156
# # BCELoss() expects both predictions & labels to be of type float.
BCELoss seems to calculated as: (number of incorrect predictions / total predictions).
age_loss = age_lossFn(age_predictions, age_labels)
age_predictions = age_predictions.squeeze().to(device)
# gender accuracy can be computed, as it is categorical in nature.
gender_predictions = gender_predictions.squeeze().to(device)
gender_accuracy = compute_gender_accuracy(gender_predictions,
gender_labels)
print(f"Validation per Batch: Age Loss = {age_loss.item()} , Gender
Accuracy Ratio = {gender_accuracy}")
total_age_loss += age_loss.item()
total_gender_accuracy += gender_accuracy
total_batches += 1
print(f"Average: Age Loss = {total_age_loss/total_batches} , Gender Accuracy
Ratio = {total_gender_accuracy/total_batches}")
Main:
import torch
import torch.nn as nn
from AgeGenderDataset import AgeGenderDataset
from torch.utils.data import DataLoader
from AgeGenderModel import AgeGenderModel
from train_validate import train, validate
import torch.optim as optim
157
device = "cuda" if torch.cuda.is_available() else "cpu"
ROWS = 10000
def run_training():
dataset_path = "../dataset/"
train_labels = "fairface-label-train.csv"
epochs = 20
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=True, drop_last=True)
model = AgeGenderModel().to(device)
age_lossFn = nn.MSELoss()
gender_lossFn = nn.BCELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Started-----------------")
train(epochs, model, dLoader, age_lossFn, gender_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
def run_validation():
dataset_path = "../dataset/"
train_labels = "fairface-label-val.csv"
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS/10)
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=False, drop_last=True)
model = AgeGenderModel().to(device)
result = model.load_state_dict(torch.load(model.model_name))
age_lossFn = nn.MSELoss()
print("------------Validation Started-----------------")
validate(model, dLoader, age_lossFn)
158
print("------------Validation Complete-----------------")
# to resume training.
def resume_training(epochs = 20):
dataset_path = "../dataset/"
train_labels = "fairface-label-train.csv"
AG_dataset = AgeGenderDataset(dataset_path, train_labels, ROWS)
dLoader = DataLoader(AG_dataset, batch_size=100, shuffle=True, drop_last=True)
model = AgeGenderModel().to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
age_lossFn = nn.MSELoss()
gender_lossFn = nn.BCELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, age_lossFn, gender_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
159
age_predictions *= AgeGenderDataset.AGE_RANGE # scale to actual age range.
age_predictions = age_predictions.to(torch.int32)
gender_predictions = gender_predictions.squeeze().to(device)
# modify gender predictions to actual values i.e. Male/Female.
gender_value_prediction = None
if(gender_predictions.item() > 0.5):
gender_value_prediction = "Female"
else:
gender_value_prediction = "Male"
print("\n")
print(f"Predictions: Age = {age_predictions.item()} , Gender =
{gender_value_prediction}")
print("\n")
#run_training()
#resume_training()
run_validation()
#test_on_input("../dataset/fairface-img-margin025-trainval/val/6000.jpg") # Actual:
age = 29, gender = Male. Predictions: age = 43, gender = Male.
#test_on_input("../dataset/fairface-img-margin025-trainval/val/6002.jpg") # Actual:
age = 24, gender = Female. Predictions: age = 28, gender = Female.
#test_on_input("../dataset/fairface-img-margin025-trainval/val/6008.jpg") # Actual:
age = 57, gender = Female. Predictions: age = 36, gender = Female.
test_on_input("../dataset/fairface-img-margin025-trainval/val/6010.jpg") # Actual:
age = 10, gender = Male. Predictions: age = 33, gender = Male.
Class activation maps are a simple technique to get the discriminative image
regions used by a CNN to identify a specific class in the image. In other words, a
class activation map (CAM) lets us see which regions in the image were relevant
to this class.
160
Class Activation Maps at different stages/layers.
We can also use Class weights for imbalance in data (in addition to loss
function) i.e. assigning higher weights to rarely occurring classes, thereby
ensuring that we explicitly mention to the model that we want to correctly classify
the rare class.
These weights can be provided to the loss function under the “weights”
arguments:
Ex (CrossEntropy loss):
#class weights for 6 class multi-class classification
class_weights = [0.5281, 0.8411, 0.9619, 0.8634, 0.8477, 0.9577]
161
Formula for calculating weights:
Ex (MSE loss):
def mse_loss(input, target):
return torch.sum((input - target) ** 2).mean() # vanilla MSE loss.
ViT uses transformers instead of CNNs for performing vision tasks such as
classification & detection.
162
Tokenization: The input image is divided into fixed-size non-overlapping
patches or tokens.
Each patch is treated as a flattened (1D) sequence of features (e.g. like an
embedding vector of a single token in NLP), and these patches are linearly
embedded into a lower-dimensional space to serve as the model's input.
Transformer requires the input token’s embedding vector to be of a fixed
size ‘D’, so a patch is mapped to a vector of dimension ‘D’ via a learnable linear
projection.
Each patch is analogous to a token(embedding vector), and all patches
together form the input sequence to the ViT. If (P,P) is the size of the patch
(downsampling ratio - lower P means higher feature resolution & vice versa),
then N = (H/P) * (W/P) is the number of patches; size of a single patch is (P2 * C);
formed for the input image with height ‘H’, width ‘W’, number of channels is ‘C’.
Size of the entire input sequence will be (N * P2 * C).
163
Self Attention: ViT uses self-attention mechanisms to compute
relationships between different patches in the image. This allows the model to
attend to relevant patches and learn contextual information for each patch.
ViT does not introduce image-specific inductive biases into the architecture apart
from the initial patch extraction step.
Hybrids (ViT with CNN backbone) slightly outperform ViT at small computational
budgets, but the difference vanishes for larger models. This result is somewhat
surprising, since one might expect convolutional local feature processing to
assist ViT at any size.
Self-attention allows ViT to integrate information across the entire image
even in the lowest layers. The “attention distance” is analogous to receptive
field size in CNNs.
Some heads attend to most of the image already in the lowest layers (i.e.
blocks - network depth), showing that the ability to integrate information globally
is indeed used by the model. Other attention heads have consistently small
attention distances in the low layers. This highly localized attention is less
pronounced in hybrid models that apply a ResNet before the Transformer,
suggesting that it may serve a similar function as early convolutional layers in
CNNs.
164
The position embeddings learn to represent 2D image topology explains why
hand-crafted 2D-aware embedding variants do not yield additional
improvements.
165
Multiclass: An image belonging to one class out of several possible classes.
Multiclass Detection.
Multilabel Detection.
YBat Tool:
For preparing dataset that contains ground truth (bounding box coordinates &
classes of objects) for given input images, we can use data annotation tools,
such as Ybat (Yolo BBox Annotation Tool). It is available at:
https://github.com/drainingsun/ybat
(1) Create a txt file for classes, that contain all the classes names (1 class per
line); say “classes.txt”. Upload it in the (opened) page ybat.html, under the
“Classes: Choose File” button.
166
(2) Prepare ground truth (drawing BoundingBox or BB around target objects
in images/dataset). Upload the images in the “Images: Choose Files” option, by
selecting all the images that are to be included for annotation.
(3) Before performing annotation(creating BB), be sure to select the
correct class from the class list. Perform annotation for all desired objects in all
images.
NOTE: BBs, once created, can be resized, moved around or deleted
using delete key (Right Cmd + Delete on mac). Restore will bring back any
deleted BBs.
(4) Save the annotated data using “Save Yolo” button. This downloads
a zipped .txt file/s of the BBs of all objects, per image/file.
For Yolo format, the classes are numbered 0 onwards, & all the BB
coordinates(x-center, y-center, width, height) are normalized from 0(origin - top
left) to 1(image dimensions - bottom-right).
X-center & y-center are the center-point of the BB. To normalize coordinates, we
divide x values by width of image, & y values by height of image.
YBat also allows saving in COCO (outputs in JSON - x-left, y-top, width,
height) or VOC (outputs in xml - xmin-left, ymin-top,xmax-right, ymax-bottom)
format. Above formats also provide the original image dimensions.
Region Proposals:
167
A Region Proposal Network, or RPN (backbone of Faster R-CNN), is a fully
convolutional network that simultaneously predicts probable object bounds
(rectangle) and objectness scores at each position. The RPN is trained end-to-
end to generate high-quality region proposals.
A region proposal that has a high intersection (computed using IoU - Intersection
over Union) with the location (ground truth) of an object in the image of interest is
labeled as the one that contains the object, and a region proposal with a low
intersection is labeled as background.
Intersection within the term Intersection over Union measures how overlapping
the predicted and actual bounding boxes are, while Union measures the overall
space possible for overlap. IoU is the ratio of the overlapping region between the
two (predicted & ground truth) bounding boxes over the combined region of both
the bounding boxes.
168
Non Max Suppression:
When multiple region proposals are generated and (their BBs) significantly
overlap one another, Non-Max Suppression can be used to select the best BB
containing the object of interest, out of all the BBs.
“Non-max” refers to the boxes that do not contain the highest probability of
containing an object, and “suppression” refers to us discarding those boxes that
do not contain the highest probabilities of containing an object. In non-max
suppression, we identify the bounding box that has the highest probability and
discard all the other bounding boxes that have an IoU greater than a certain
threshold with the box containing the highest probability of containing an object.
mAP:
169
mAP is measured by taking the mean of all average precisions (the area under a
Precision vs Recall curve) across all IoU thresholds and for all classes. This
metric provides an overall model performance, irrespective of any manually-set
threshold.
i.e. P = TP / TP + FP.
P = TP / Total (positive) Predictions (of that class by the model).
Precision measures how accurate your predictions are. i.e. the percentage of
correct predictions (from all predictions). i.e. identifies only positive cases.
A True Positive refers to the bounding boxes that predict the correct class
of objects and that have an IoU with the ground truth that is greater than a certain
threshold. A False Positive refers to the bounding boxes that predicted the class
incorrectly or have an overlap that is less than the defined threshold with the
ground truth. Furthermore, if there are multiple bounding boxes that are identified
for the same ground truth bounding box, only one box can get into a true positive,
and everything else gets into a false positive.
Precision is the ability of a model to identify only the relevant objects.
170
i.e. R = TP / TP + FN.
R = TP / Total (positive) Ground-Truth.
Recall measures how well you find all the (actual) positives. i.e. identifies all of
the positive cases.
Recall is the ability of a model to find all the relevant (ground truth) objects.
High scores for both show that the classifier is returning accurate results
(high precision), as well as returning a majority of all positive results (high recall).
A system with high recall but low precision returns many results, but most
of its predicted labels are incorrect when compared to the training labels. A
system with high precision but low recall is just the opposite, returning very few
results, but most of its predicted labels are correct when compared to the training
labels. An ideal system with high precision and high recall will return many
results, with all/most results labeled correctly.
Precision and Recall are the two most common metrics that take into
account class imbalance (when you observe more data points of one class than
of another).
These quantities are also related to the (F1) score (evaluation metric that
measures a model's accuracy), which is defined as the harmonic mean of
precision and recall.
F1 = 2 * (P * R / P + R)
171
metric for the more common arithmetic mean), to handle any potential imbalance
in precision/recall values, because it punishes any extreme values.
Harmonic mean is near to the smallest of the input numbers,
minimizing the impact of the large outliers and maximizing the impact of small
ones.
Ex: A classifier with a precision of 1.0 and a recall of 0.0 has a simple
average of 0.5 but an F1 score of 0.
Since the F1 score is an average of Precision and Recall, it means that the
F1 score gives equal weight to Precision and Recall:
- A model will obtain a high F1 score if both Precision and Recall are high.
- A model will obtain a low F1 score if both Precision and Recall are low.
- A model will obtain a medium F1 score if one of Precision and Recall is low
and the other is high.
F1 score ranges from 0 to 1.
For detection, a common way to determine if one object proposal was right
is Intersection over Union (IoU). Commonly, IoU > 0.5 means that it was a hit,
otherwise it was a fail. If one wants better proposals, one does increase the IoU
from 0.5 to a higher value (up to 1.0 which would be perfect). One can denote
this with mAP@p, where 𝑝∈(0,1) is the IoU threshold.
172
Ex: mAP@0.5 = 0.98 means mAP with IoU=0.5 (50% overlap) has an
accuracy of 0.98 (98%).
Object detection can be done using several model architectures, such as R-CNN
(Region-based CNN), Fast R-CNN, Faster R-CNN, YOLO, SSD, combining CV
with NLP using transformers (using positional embedding to identify regions
containing the object) such as DETR (Detection Transformer), etc.
R-CNN:
R-CNN assists in identifying both the objects present in the image and the
location of objects within the image.
(1) Extract region proposals from an image: Ensure that we extract a high
number of proposals to not miss out on any potential object within the image.
(2) Resize (warp) all the extracted regions to get images of the same size.
(3) Pass the resized region proposals through a network: Typically, we pass
the resized region proposals through a pretrained model such as VGG16 or
ResNet50 and extract the features in a fully connected layer. We can also use
MobileNet(v2) instead of VGG16, for feature maps, it is smaller & gives similar
accuracy, while being faster.
173
(4) Create data for model training, where the input is features extracted by
passing the region proposals through a pretrained model, and the outputs are the
class corresponding to each region proposal and the offset of the region proposal
(RP) from the ground truth corresponding to the image:
If a region proposal has an IoU greater than a certain threshold with the
object, we prepare training data in such a way that the concerned region is
responsible for predicting the class of object it is overlapping with and also the
offset of the region proposal with the ground truth bounding box that contains the
object of interest (computed as {BB_coordinates - RP_coordinates}).
We calculate the offset between the region proposal bounding box and the
ground truth bounding box as the difference between center coordinates of the
two bounding boxes (dx, dy) and the difference between the height and width of
the bounding boxes (dw, dh).
(5) Connect two output heads, one corresponding to the class of image and
the other corresponding to the offset of the region proposal with the ground truth
bounding box to extract the fine bounding box on the object (similar to Multi-Task
training - categorical class variable & continuous offset variable).
(6) Train the model, writing a custom loss function that minimizes both object
classification error and the bounding box offset error.
For the scenario of object detection, we will download the data from the Google
Open Images v6 dataset (available at
https://www.kaggle.com/datasets/sixhky/open-images-bus-trucks). However, in
code, we will work on only those images that are of a bus or a truck.
174
Object detection includes defining the functions/operations for:
(a) region proposal extraction
(b) IoU calculation
Illustration of populating IoU for candidates/RP for BBs(in case a single image
contains multiple BB for multiple objects/labels):
175
We then find the best IoU for that candidate/RP, which in turn gives the best BB
corresponding to that IoU - i.e. which ground truth BB this candidate best
corresponds to.
Then, if this best IoU is above a threshold, we assign the label for this RP as the
label for the corresponding ground truth BB; else we assign the label as
background.
Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
from CommonObjDetectionFunctions import generate_region_proposals, compute_iou,
assign_classes_via_IoU, compute_BB_offsets
from torchvision.ops import nms # for non-maximum suppression.
class BusTruckDataset(Dataset):
resize_value = 224
labels = ["Background", "Bus", "Truck"]
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(resize_value), transforms.CenterCrop(resize_value),
transforms.ConvertImageDtype(torch.float32), transforms.Normalize(mean=[0.485, 0.456,
0.406], std=[0.229, 0.224, 0.225])])
def __len__(self):
176
return len(self.data)
# flattens data in a format suitable for usage in getitem(). Note that if label is
0/background, delta will be empty.
def flatten_data(self, IMG_PATHS, RP_LOCATIONS, LABELS, DELTAS):
self.data = []
for sub_list_index, candidates in enumerate(RP_LOCATIONS):
image_name = IMG_PATHS[sub_list_index] # get image name.
for element_index, rp in enumerate(candidates):
# get the label for this RP.
labels_sublist = LABELS[sub_list_index]
label = labels_sublist[element_index]
# get delta for this RP.
deltas_sublist = DELTAS[sub_list_index]
delta = deltas_sublist[element_index]
# accumulate image name, RP, label, delta.
self.data.append((image_name, rp, label, delta))
#print(len(self.data)) # this should be same as self.DATA_LENGTH (used for
testing only).
177
# prepares data, to be consumed by thisDataset. prepare_data() is somewhat specific
to the data format provided in csv dataset file.
def prepare_data(self, input_csv_file, Rows, train):
IMG_PATHS = [] # holds image names from dataset.
RP_LOCATIONS = [] # for EACH entry in 'IMG_PATHS', holds it's (list of) RP
locations. It's a list of lists. The values are not normalized, so the raw values can
be used to fetch sub section (RP) after loading the image.
LABELS = [] # for each entry in 'RP_LOCATIONS', holds it's class
values(list of lists).
DELTAS = [] # for each entry in 'RP_LOCATIONS', holds the offsets from
ground truth BB for that particular individual RP.
if train:
# for training, use first 'Rows'.
df = pd.read_csv(input_csv_file, nrows=Rows)
else:
# for validation, use last 'Rows'.
df = pd.read_csv(input_csv_file)
df = df.tail(Rows)
# Start extracting data & presenting it in desired form.
for _, row in df.iterrows():
# generate RPs for this image, and append all these RPs along with their
classes (or background)
img = cv2.imread(self.image_path + row["ImageID"] + ".jpg")
candidates, normalized_candidates = generate_region_proposals(img)
bb = row["XMin"], row["YMin"], row["XMax"], row["YMax"] # ground truth
(normalized) BB.
# compute IoUs for each candidate in candidates list of current image
'row["ImageID"]'.
curr_image_ious = [compute_iou(c, bb) for c in normalized_candidates]
curr_RP_classes = [assign_classes_via_IoU(iou, row["LabelName"],
BusTruckDataset.labels) for iou in curr_image_ious]
IMG_PATHS.append(row["ImageID"])
RP_LOCATIONS.append(candidates) # un-normalized RP locations, for
directly getting sub image(RP) from loaded input image.
LABELS.append(curr_RP_classes)
# compute offsets from ground truths(bb) for each (normalized) candidate,
only if assigned class is not of type background.
curr_deltas = compute_BB_offsets(normalized_candidates, bb,
curr_RP_classes)
DELTAS.append(curr_deltas)
# flatten data for further usage.
self.flatten_data(IMG_PATHS, RP_LOCATIONS, LABELS, DELTAS)
178
# Testing - calculate the length of data - use LABELS, as it's element type is
simplest(int).
#self.DATA_LENGTH = sum([len(sublist) for sublist in self.LABELS])
#print(f"Preparing Data Complete.")
179
# gets the best detections in order of highest probabilities first. in the returned
list.
def get_best_detections(input_image, normalized_candidates,
class_predictions_for_RPs, bb_prediction):
# (1) get the probabilities & score to be used later, in nms.
class_probabilities = torch.nn.functional.softmax(class_predictions_for_RPs, -
1)
# torch.max() returns max values of each row in given dimension, & their
indices(convert probabilities to classes).
class_prob_scores, class_predictions_for_RPs = torch.max(class_probabilities, -
1)
# (2) extract predictions, their RPs & BB predictions; that do not belong to
background class.
class_predictions_indices = torch.where(class_predictions_for_RPs != 0)
class_predictions_for_RPs =
class_predictions_for_RPs[class_predictions_indices]
normalized_candidates =
torch.tensor(normalized_candidates[class_predictions_indices])
class_prob_scores = class_prob_scores[class_predictions_indices].detach()
bb_prediction = bb_prediction[class_predictions_indices].detach()
180
best_class_prob_scores.append(class_prob_scores[best_index])
return best_class_predictions, best_bbs, best_class_prob_scores
CommonObjDetectionFunctions:
# this file contains functionalities common to Objection Detection, that can be reused
again.
import numpy as np
import selectivesearch
# generates & returns the region proposals (in both raw and normalized form) for a
given image.
def generate_region_proposals(img, pScale=200, pMin_size=100):
# selectivesearch() requires numpy array as input.
img_label, regions = selectivesearch.selective_search(np.array(img), scale=pScale,
min_size=pMin_size)
img_area = img.shape[0] * img.shape[1] # area = width*height.
RP_candidates = [] # holds region proposal candidates.
for r in regions:
if r["rect"] in RP_candidates: # ignore, if RP(tuple form - x,y,w,h) is
duplicate.
continue
if r["size"] < (0.05 * img_area): # ignore, if size is too small.
continue
if r["size"] > (1*img_area): # ignore, if size is too large.
continue
# add this rect (RP) to candidates list.
RP_candidates.append(r["rect"])
# convert x,y,w,h to x1,y1,x2,y2 i.e. points format.
RP_candidates = [(x1, y1, x1+x2, y1+y2) for (x1, y1, x2, y2) in RP_candidates]
# normalize candidates(bb1) values within image boundaries i.e. from pixel values
in image dimensions; to 0-1 range. Note that ground truth BB(bb2) is in normalized
range. Having both BB & candidates in normalized/same range will be required while
calculating IoU.
width = img.shape[1]
height = img.shape[0]
RP_candidates_normalized = RP_candidates / np.array([width, height, width, height])
return RP_candidates, np.float32(RP_candidates_normalized)
181
# This function computes & returns the IoU between 2 bounding boxes bb1 & bb2. Returns
0 if they do not overlap.
def compute_iou(bb1, bb2, epsilon=1e-5):
# get sub rect that's overlapping (if overlap exists).
x1 = max(bb1[0], bb2[0])
y1 = max(bb1[1], bb2[1])
x2 = min(bb1[2], bb2[2])
y2 = min(bb1[3], bb2[3])
width = x2 - x1
height = y2 - y1
# If no overlap, return 0.
if(width < 0) or (height < 0):
return 0
area_overlap = width * height # get area of overlap (Intersection).
# compute individual areas.
area_1 = (bb1[2] - bb1[0]) * (bb1[3] - bb1[1])
area_2 = (bb2[2] - bb2[0]) * (bb2[3] - bb2[1])
# get total area (union).
area_combined = area_1 + area_2 - area_overlap
iou = area_overlap / (area_combined + epsilon) # get IoU.
return np.float32(iou)
# this function determines the class to be assigned to each RP, by checking it's IoUs
for a threshold. The returned value is the class index.
def assign_classes_via_IoU(iou, label_class, labels):
THRESHOLD = 0.3
if iou > THRESHOLD:
return labels.index(label_class) # assign class from database.
else:
return 0 # return Background class index.
# this function calculates and returns the difference between candidate BB & ground
truth BB, for all candidates. If the assigned class if of type background, store delta
as '0'.
def compute_BB_offsets(normalized_candidates, bb, curr_RP_classes):
deltas = [] # contains the results.
for index, candidate in enumerate(normalized_candidates):
if curr_RP_classes[index] == 0:
182
deltas.append([0,0,0,0]) # assign 0 if assigned class type is
background.
else:
deltas.append(bb - candidate) # compute difference between BB & RP
(in that order) & store the offset.
return deltas
ObjDetectModel:
import torch
from torchvision import models
import torch.nn as nn
from ObjDetectSubModule import ObjDetectSubModule
class ObjDetectModel(nn.Module):
#model_name = "trained_models/ObjDetectModel_VGG16.pth"
def __init__(self) -> None:
super().__init__()
self.model_name = "trained_models/ObjDetectModel_VGG16.pth"
self.model = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
# disable learning on all layers.
for param in self.model.parameters():
param.requires_grad = False
#summary(model, torch.zeros(1, 3, 224, 224))
# override layers. Input to model.avgpool = (512,7,7)
#self.model.avgpool = nn.Sequential(nn.Conv2d(512, 512, kernel_size=3),
nn.MaxPool2d(2), nn.ReLU(), nn.Flatten()) # input image size = 7x7. Conv2d = 7-
2=5, Max2D = 5/2=2. output = 512*2*2 = 2048.
# Not modifying "self.model.avgpool" above; so that inputs to classifier below
has more number of nodes (~25k) might result in better BB prediction, as per the book
example.
#self.model.classifier = ObjDetectSubModule(2048)
183
class_predictions, bb_predictions = self.model(input)
return class_predictions, bb_predictions
ObjDetectSubModule:
import torch
import torch.nn as nn
from BusTruckDataset import BusTruckDataset
class ObjDetectSubModule(nn.Module):
def __init__(self, features_dim) -> None:
super().__init__()
# label classifier: output nodes = number of labels.
#self.label_classifier = nn.Sequential(nn.Linear(features_dim,
len(BusTruckDataset.labels)), nn.Sigmoid())
self.label_classifier = nn.Sequential(nn.Linear(features_dim, 6272), nn.ReLU(),
nn.Linear(6272, 1568), nn.ReLU(), nn.BatchNorm1d(1568), nn.Linear(1568, 392),
nn.ReLU(), nn.Linear(392, len(BusTruckDataset.labels)), nn.Sigmoid()) #
6272/4=1568, 1568/4=392.
Train_validate:
184
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
import math
# This custom loss function penalizes more for predictions where actual classes are
not predicted correctly(or predicted as background), & less where actual background
classes are predicted as non-background classes.
def custom_classification_loss_fn(class_preds, class_labels, classification_lossFn):
# 'background_loss' - when background is wrongly predicted as classes.
class_indices = torch.where(class_labels == 0)
# get predictions for indices where labels are background.
class_preds_background = class_preds[class_indices]
class_labels_background = class_labels[class_indices]
# compute loss i.e. when classes are incorrectly predicted as non-background.
background_loss = classification_lossFn(class_preds_background,
class_labels_background)
185
# accumulated losses over all batches in an epoch.
acc_classification_loss = 0
acc_bb_loss = 0
acc_total_loss = 0
num_iterations = 0
for _, [input_images, class_labels, gt_bb_offsets] in enumerate(dLoader):
optimFn.zero_grad()
class_predictions, bb_predictions = model(input_images) # "input_images"
are the 'Proposed Region'.
186
{(acc_bb_loss/num_iterations):0.5f} , Total Loss =
{(acc_total_loss/num_iterations):0.5f}")
class_indices = torch.where(class_labels != 0)
bb_predictions = bb_predictions[class_indices]
gt_bb_offsets = gt_bb_offsets[class_indices]
bb_loss = BB_lossFn(bb_predictions, gt_bb_offsets)
total_bb_loss += bb_loss_value
total_iterations += 1
print(f"Validation per Batch: Classification Accuracy:(Background Accuracy:
{background_accuracy_ratio:0.5f}, Correct Class Accuracy:
{correct_class_accuracy_ratio:0.5f}, Total Accuracy: {total_accuracy_ratio:0.5f}) , BB
Detection Loss = {bb_loss_value:0.5f}")
print("\n")
print(f"Average: Classification Accuracy =
{total_accuracy_ratio/total_iterations} , BB Detection Loss =
{total_bb_loss/total_iterations}")
print("\n")
187
# computes the accuracy for categorical variable/field i.e. classification.
def compute_accuracy(predictions, labels):
total = labels.shape[0] # basically batch_size (rows)
predictions = torch.argmax(predictions, dim=1)
Main:
import torch
import torch.nn as nn
from BusTruckDataset import BusTruckDataset
from torch.utils.data import DataLoader
from ObjDetectModel import ObjDetectModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2
from torchvision import transforms
188
# link: https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/blob/
master/Chapter07/Training_RCNN.ipynb
ROWS = 1000 #1000 # number of rows to use from dataset, since dataset is huge;
& generating Region Proposals adds significantly to even small number of data.
def run_training():
epochs = 5 #20
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, shuffle=True, drop_last=True)
model = ObjDetectModel().to(device)
label_lossFn = nn.CrossEntropyLoss()
delta_lossFn = nn.MSELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Started-----------------")
train(epochs, model, dLoader, label_lossFn, delta_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
def run_validation():
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", int(ROWS/5),
False)
dLoader = DataLoader(dataset, batch_size=100, shuffle=False, drop_last=True)
model = ObjDetectModel().to(device)
189
result = model.load_state_dict(torch.load(model.model_name))
delta_lossFn = nn.MSELoss()
print("\n")
print("------------Validation Started-----------------")
validate(model, dLoader, delta_lossFn)
print("------------Validation Complete-----------------")
print("\n")
# to resume training.
def resume_training(epochs = 20):
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, shuffle=True, drop_last=True)
model = ObjDetectModel().to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
label_lossFn = nn.CrossEntropyLoss()
delta_lossFn = nn.MSELoss()
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, label_lossFn, delta_lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
190
return tensor_img.numpy().astype(np.uint8) # format = HxWxC (1,2,0), as
required by cv2 images.
191
##########
cv2.imshow("window1", input_image)
cv2.waitKey(0)
print("Prediction completed.")
192
# draw image.
img = cv2.rectangle(img, startPoint, endPoint, color, 1)
# display both label & it's predicted probability/score above BB in image.
if predicted_label_score == 1: # for ground truth BB, do not show score.
label_description = best_predicted_label
else:
label_description = best_predicted_label + " (" +
f"{predicted_label_score:0.3}" + ")"
img = cv2.putText(img, label_description, (startX, startY-5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
return img
#run_training()
#resume_training(10)
#run_validation()
# images upto 1000, already in training. Detection performs well in training images.
#test_on_input("../dataset/df.csv", 500)
test_on_input("../dataset/df.csv", 600)
In Fast-RCNN, instead of feeding the region proposals to the CNN (as in RCNN),
we:
(a) feed the input image to a CNN (a pretrained model) to generate a
convolutional feature map.
193
(b) From the convolutional feature map, we identify the region of proposals
(via selectivesearch) and warp them into squares and by using a RoI (Region of
Interest) pooling layer to reshape them into a fixed size, so that it can be fed into
a fully connected layer. This is a replacement for the warping that was executed
in the R-CNN technique.
(c) From the RoI feature vector, we use a softmax layer to predict the class of
the proposed region and also the offset values for the bounding box.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have
to feed “N” region proposals to the convolutional neural network every time.
It takes the whole image and region proposals as input in its CNN
architecture in one forward propagation.
The convolution operation is done only once per image and a feature map
is generated from it, then generating region proposals from this single feature
map & passing (thereby avoiding the need to pass each resized RP through the
Conv2D) through the model.
When you look at the performance of Fast R-CNN during testing time,
including region proposals slows down the algorithm significantly when compared
to not using region proposals. Therefore, region proposals become bottlenecks in
the Fast R-CNN algorithm affecting its performance.
R-CNN vs Fast-RCNN.
Faster R-CNN:
Both of the above algorithms (R-CNN & Fast R-CNN) use selective search to find
out the region proposals. Selective search is a slow and time-consuming process
affecting the performance of the network.
194
Faster RCNN eliminates the selective search algorithm and lets the
network learn the region proposals. Similar to Fast R-CNN, the image is provided
as an input to a convolutional network which provides a convolutional feature
map. Instead of using selectivesearch algorithm on the feature map to identify
the region proposals, a separate network (RPN - Region Proposal Network) is
used to predict the region proposals. The predicted region proposals are then
reshaped using a RoI pooling layer which is then used to classify the image
within the proposed region and predict the offset values for the bounding boxes.
Faster RCNN is much faster than it's predecessors, hence it can even be
used for real-time object detection.
Anchor Boxes:
Anchor Boxes are a handy replacement for selectivesearch that was used to
compute RPs.
When we have a decent idea of the width, height & aspect ratio
(height/width) of the objects in our dataset (from the provided ground truth BB),
we define the anchor boxes with heights and widths representing the majority of
object’s bounding boxes within our dataset (can be obtained by employing K-
means clustering on top of the ground truth bounding boxes).
Usage:
(a) Slide each anchor box over an image from top left to bottom right.
(b) The anchor box that has a high (above a threshold) intersection over union
(IoU) with the object will have a label that mentions that it contains an object, and
the others will be labeled 0.
Once we obtain the ground truths as defined here, we can build a model that can
predict the location of an object and also the offset corresponding to the anchor
box to match it with ground truth.
195
For each stride/slide of an anchor box in the image, we feed the image
crop (crop a sub-image at, & equal to anchor box) to the RPN, indicating whether
the crop contains an object.
Essentially, an RPN suggests the likelihood of a crop containing an object.
We take each region candidate and compare with the ground truth bounding
boxes of objects in an image to identify whether the IoU between a region
candidate and a ground truth bounding box is greater than a certain threshold. If
the IoU is greater than a certain threshold (say, 0.5), the region candidate
contains an object, and if the IoU is less than a threshold (say 0.1), the region
candidate does not contain an object and all the candidates that have an IoU
between the two thresholds (0.1 - 0.5) are ignored while training.
Once we train a model to predict if the region candidate contains an
object, we then perform non-max suppression, as multiple overlapping regions
can contain an object.
Below is a sketch that shows how a 3x3 sliding window (red colour) of the
RPN is applied on some location (blue dot) of a feature map with 512 channels.
196
anchor boxes of fixed scales and aspect ratios. The regression head of the RPN
outputs 4 values 𝑡𝑥,𝑡𝑦,𝑡𝑤,𝑡ℎ for each anchor box, which are then used to resize
and move the center of each anchor box to get a region-proposal (together with
objectness score obtained by the classification branch (softmax classifier) of the
RPN).
RPNs use anchor boxes that serve as references at multiple scales and
aspect ratios. The scheme can be thought of as a pyramid of regression
references, which avoids enumerating images or filters of multiple scales or
aspect ratios.
For every region of interest from the input list, it takes a section of the input
feature map that corresponds to it and scales it to some predefined (output) size
(e.g., 7×7).
197
Thus, the output buffer for each RP (always of the same size) will contain
the max values in that section.
The major hurdle for going from image classification to object detection is fixed
size input requirement to the network because of existing fully connected layers.
In object detection, each proposal will be of a different shape. So there is a need
for converting all the proposals to fixed shape as required by fully connected
layers. ROI Pooling does exactly this.
The result of RoIPooling is that; from a list of rectangles with different sizes, we
can quickly get a list of corresponding feature maps with a fixed size. Note that
198
the dimension of the RoI pooling output doesn’t actually depend on the size of
the input feature map nor on the size of the region proposals. It’s determined
solely by the number of sections we divide the proposal into.
What’s the benefit of RoI pooling? One of them is processing speed. If there are
multiple object proposals on the frame (and usually there’ll be a lot of them), we
can still use the same input feature map for all of them. Since computing the
convolutions at early stages of processing is very expensive, this approach can
save us a lot of time.
What are the most important things to remember about RoI Pooling?
199
- RoIHeads takes the preceding maps, aligns them using RoI pooling, processes
them, and returns classification probabilities for each proposal and the
corresponding offsets.
Dataset:
import torch
from torch.utils.data import Dataset
from torchvision import transforms
import cv2, pandas as pd
import numpy as np
from torchvision.ops import nms # for non-maximum suppression.
class BusTruckDataset(Dataset):
resize_value = 224
labels = [] #["Background", "Bus", "Truck"]
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Resize(resize_value), transforms.CenterCrop(resize_value),
transforms.ConvertImageDtype(torch.float32)])
def __len__(self):
return len(self.data)
200
def __getitem__(self, index):
image_name, target = self.data[index]
# read entire image & preprocess it.
img = cv2.imread(self.image_path + image_name + ".jpg") # read entire
image.
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)/255 # reorder channels &
convert values to range [0,1].
img = BusTruckDataset.preprocess(img) # perform preprocessing.
# return image.
return img.to(device), target
201
df = df.tail(Rows)
# get output label classes.
BusTruckDataset.labels = BusTruckDataset.get_output_classes(df)
# get unique images, that we will later iterate, to get all entries for each
unique image.
self.unique_images = df["ImageID"].unique()
202
def load_for_testing(input_filename, row_number):
# (1) load & preprocess input.
df = pd.read_csv(input_filename)
BusTruckDataset.labels = BusTruckDataset.get_output_classes(df)
row = df.iloc[[row_number-1][0]]
input_image = cv2.imread("../dataset/images/" + str(row["ImageID"]) + ".jpg")
width = input_image.shape[1]
height = input_image.shape[0]
input_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)/255
input_image = BusTruckDataset.preprocess(input_image)
gt_bb = torch.tensor([row["XMin"], row["YMin"], row["XMax"], row["YMax"]])
gt_bb *= np.array([width, height, width, height])
return input_image, row["LabelName"], gt_bb
# The output of the trained Faster-RCNN model contains boxes, labels, and scores
corresponding to classes.
def decode_outputs(outputs):
# extract bounding boxes from outputs structure.
outputs = outputs[0]
bbs = outputs["boxes"].cpu().detach()
# extract label names.
labels = outputs["labels"].cpu()
# extract confidence scores.
scores = outputs["scores"].cpu().detach()
nms_indices = nms(bbs, scores, 0.05)
# retain only those values that correspond to nms_indices.
bbs = bbs[nms_indices]
labels = labels[nms_indices]
scores = scores[nms_indices]
return bbs.tolist(), labels.tolist(), scores.tolist()
Model:
import torch
from torchvision import models
import torch.nn as nn
import torchvision
203
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor # FRCNN model
- provides a model that has been pre-trained with the COCO dataset using ResNet50.
#
https://pytorch.org/vision/main/models/generated/torchvision.models.detection.fasterrc
nn_resnet50_fpn.html
"""
Faster-RCNN contains the following submodules:
- GeneralizedRCNNTransform is a simple resize followed by a normalize transformation.
GeneralizedRCNNTransform(
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Resize(min_size=(800,), max_size=1333, mode='bilinear')
)
- BackboneWithFPN is a neural network that transforms input into a feature map.
- RegionProposalNetwork generates the anchor boxes for the preceding feature map and
predicts individual feature maps for classification and regression tasks.
RegionProposalNetwork(
AnchorGenerator()
RPNHead(
conv(): (Conv2D(256, 256, kernel_size(3,3)), stride=(1,1), padding=(1,1))
cls_logits(): (Conv2D(256, 3, kernel_size(1,1)), stride=(1,1))
bbox_pred(): (Conv2D(256, 12, kernel_size(1,1)), stride=(1,1))
)
)
- RoIHeads takes the preceding maps, aligns them using RoI pooling, processes them,
and returns classification probabilities for each proposal and the corresponding
offsets.
RoIHeads(
(box _ roi _pool): MultiScaleRolAlign()
(box head): TwoMLPHead( (fc6): Linear(infeatures=12544, out features=1024,
bias=True)
(fc7): Linear(in_features=1024, out_features=1024, bias=True)
)
(box_predictor): FastRCNNPredictor(
(cls score): Linear(in features=1024, out features=2, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
)
)
"""
204
class FRCNNModel(nn.Module):
def __init__(self, num_classes) -> None:
super().__init__()
self.model_name = "trained_models/ObjDetectModel_FRCNN.pth"
self.model =
torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=models.detection.FasterRC
NN_ResNet50_FPN_Weights.DEFAULT)
# disable learning on all layers.
#summary(model, torch.zeros(1, 3, 224, 224))
in_features = self.model.roi_heads.box_predictor.cls_score.in_features
# The predictor is what that outputs the classes and the corresponding bboxes.
# num_classes = number of classes + 1(for background).
self.model.roi_heads.box_predictor = FastRCNNPredictor(in_features,
num_classes)
train-validate:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
import math
205
total_loc_loss = 0; total_regr_loss = 0; total_loss_objectness = 0;
total_loss_rpn_box_reg = 0
total_loss = 0
for _, [input_images, targets] in enumerate(dLoader):
optimFn.zero_grad()
losses = model(input_images, targets) # 'losses' is a dictionary of
losses.
# compute total loss.
loss = sum(loss for loss in losses.values()) # can also customize how
individual losses are assigned weightage.
loss.backward()
optimFn.step()
num_iterations += 1
loc_loss, regr_loss, loss_objectness, loss_rpn_box_reg = [losses[k] for k
in ['loss_classifier','loss_box_reg','loss_objectness','loss_rpn_box_reg']]
# accumulate individual losses over iterations.
total_loc_loss += loc_loss.item()
total_regr_loss += regr_loss.item()
total_loss_objectness += loss_objectness.item()
total_loss_rpn_box_reg += loss_rpn_box_reg.item()
total_loss += loss.item()
print(f"Epoch:{epoch+1} - Training: Location Loss =
{(total_loc_loss/num_iterations):0.5f}, Regression Loss =
{total_regr_loss/num_iterations:0.5f}, Objectness Loss =
{total_loss_objectness/num_iterations:0.5f}, RPN box Regression Loss =
{total_loss_rpn_box_reg/num_iterations:0.5f}, Total loss =
{total_loss/num_iterations:0.5f}")
206
loc_loss, regr_loss, loss_objectness, loss_rpn_box_reg = [losses[k] for k
in ['loss_classifier','loss_box_reg','loss_objectness','loss_rpn_box_reg']]
#total_iterations += 1
print(f"Validation per Batch Loss: Location Loss =
{(loc_loss.item()):0.5f}, Regression Loss = {regr_loss.item():0.5f}, Objectness Loss =
{loss_objectness.item():0.5f}, RPN box Regression Loss =
{loss_rpn_box_reg.item():0.5f}, Total loss = {loss.item():0.5f}")
print("\n")
main:
import torch
from BusTruckDataset import BusTruckDataset
from torch.utils.data import DataLoader
from FRCNNModel import FRCNNModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2
# link: https://github.com/PacktPublishing/Modern-Computer-Vision-with-PyTorch/blob/
master/Chapter08/Training_Faster_RCNN.ipynb
ROWS = 3000 #1000 # number of rows to use from dataset, since dataset is huge;
& generating Region Proposals adds significantly to even small number of data.
207
print("------------Training Started-----------------")
train(epochs, model, dLoader, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
def run_validation():
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", int(ROWS/5),
False)
dLoader = DataLoader(dataset, batch_size=100, collate_fn=dataset.collate_fn,
shuffle=False, drop_last=True)
model = FRCNNModel(len(dataset.labels)).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
result = model.load_state_dict(torch.load(model.model_name))
print("\n")
print("------------Validation Started-----------------")
validate(model, dLoader, optimFn)
print("------------Validation Complete-----------------")
print("\n")
# to resume training.
def resume_training(epochs = 20):
dataset = BusTruckDataset("../dataset/images/", "../dataset/df.csv", ROWS, True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=100, collate_fn=dataset.collate_fn,
shuffle=True, drop_last=True)
model = FRCNNModel(len(dataset.labels)).to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
optimFn = optim.Adam(model.parameters(), lr=0.0005)
try:
print("\n")
208
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
209
input_image = convert_tensor_image_to_opencv(input_image)
cv2.imshow("window1", input_image)
cv2.waitKey(0)
print("Prediction completed.")
210
# draw image.
img = cv2.rectangle(img, startPoint, endPoint, color, 1)
# display both label & it's predicted probability/score above BB in image.
if predicted_label_score == 1: # for ground truth BB, do not show score.
label_description = best_predicted_label
else:
label_description = best_predicted_label + " (" +
f"{predicted_label_score:0.3}" + ")"
img = cv2.putText(img, label_description, (startX, startY-5),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
return img
#run_training(10)
#resume_training(10)
#run_validation()
# images upto 1000, already in training. Detection performs well in training images.
#test_on_input("../dataset/df.csv", 500)
#test_on_input("../dataset/df.csv", 600)
1) Faster R-CNN works on the concept of sliding anchors over image, hence it is
possible that regions are generated that do not fully encompass the object in
image, hence model has to guess the real BBs.
YOLO looks at the whole image while predicting the BB.
211
Faster RCNN performs detection on various region proposals and
ends up performing predictions multiple times of various regions of an image, on
the other hand, YOLO architecture is more like a fully connected convolutional
neural network (FCNN), the image passes through the FCNN once and then the
output gives the prediction.
2) Faster R-CNN is still slower, as we have 2 networks: RPN & final network for
prediction.
YOLO has a single network to look at the whole image at once, & make
predictions in real time.
Faster R-CNN offers region of interest (RoIs) to perform convolution
on it while YOLO does detection and classification at the same time.
(1) Residual Blocks: Divide input image into NxN grids (say N=3) of equal
size (For YOLOv8, default grid size seems to be (32,32) - see source code in
ultralytics/yolo/engine/trainer.py).
YOLO performance depends on grid size.
212
(2) Bounding Box Regression:
(a) Identify those grid cells that contain the center of at least one ground
truth bounding box. In our case, they are cells b1 and b3 of our 3 x 3 grid image.
The cell(s) where the middle point of the ground truth bounding box falls
is/are responsible for predicting the bounding box of the object.
(b) The output ground truth corresponding to a cell is as
follows(considering there are 3 classes - c1=truck, c2=car, c3=bus):
pc - (the objectness score) is the probability of the cell containing an object (any
of the classes).
213
Background class is not needed here, as pc=0 means the cell does not
contain any of the objects/classes from training data, so the rest of the data don't
make any sense.
bx, by - the location (offset) of the midpoint of the ground truth bounding box with
respect to the grid cell origin ((0.5, 0.5) in above grid cell image; as the midpoint
of the ground truth is at a distance of 0.5 units from the origin, from both X & Y
axis).
bw, bh - bw is the ratio of the width of the bounding box with respect to the width
of the grid cell (bw = bbW / gW in below image for cell b3). bh is the ratio of the
height of the bounding box with respect to the height of the grid cell (bh = bbH /
gH).
214
class in b1 cell being car; resulting in c2 being 1, while c1 and c3 are 0.
(3) IoU: a single object in an image can have multiple grid box
candidates for prediction (see anchors below), even though not all of them are
relevant. The goal of the IOU is to discard such grid boxes to only keep those
that are relevant.
Training Stage:
During the training process, YOLO calculates IoU between the predicted bounding boxes and
the ground truth bounding boxes to determine the quality of the predictions. IoU is computed
by dividing the area of intersection between the predicted and ground truth bounding boxes by
the area of their union. This gives a measure of how well the predicted box aligns with the
actual object location.
If the IoU is above a certain threshold (usually around 0.5 or 0.6), the predicted bounding box is
considered a "true positive" match for the corresponding object. It means the algorithm
successfully detected the object accurately.
If the IoU is below the threshold, the predicted bounding box is treated as a "false positive" or a
"false negative" depending on whether there is a ground truth object in that location. This
provides information for the algorithm to improve its predictions.
Inference Stage:
When using a trained YOLO model for object detection on new images, the algorithm predicts
bounding boxes for potential objects (each grid’s output will have a prediction for each class
with some confidence score). These predictions are then filtered using a process called non-
maximum suppression (NMS), which is based on IoU.
For each class, the predicted bounding boxes are sorted by their confidence scores (the
probability that the predicted box contains an object of that class). Starting from the box with the
highest confidence, YOLO compares the IoU of this box with the IoU of all subsequent boxes
for the same class.
If the IoU between two (predicted) boxes is above a certain threshold, the box with the lower
confidence score is suppressed (removed). This prevents multiple bounding boxes from
being detected for the same object.
This process is repeated until all boxes have been compared and potentially suppressed.
215
By using IoU in both training and inference, YOLO ensures that it detects objects accurately
and produces a single, well-aligned bounding box for each object, even in cases where objects
might overlap or be close to each other.
(4) NMS: Setting a threshold for the IOU is not always enough because
an object can have multiple boxes with IOU beyond the threshold, and leaving all
those boxes might include noise. Here is where we can use NMS to keep only
the boxes with the highest probability score of detection.
In the preceding example, the midpoint of the ground truth bounding boxes for
both the car and the person fall in the same cell – cell b1.
Anchor boxes come in handy in such a scenario. Let's say we have two anchor
boxes – one that has a greater height than width (corresponding to the person)
and another that has a greater width than height (corresponding to the car).
The output for each cell in a scenario where we have two anchor boxes is
represented as a concatenation of the output expected of the two anchor boxes:
216
Here, bx, by, bw, and bh represent the offset from the anchor box (which is
the universe in this scenario as seen in the image instead of the grid cell). From
the preceding screenshot, we see we have an output that is 3 x 3 x 16, as we
have two anchors. The expected output is of the shape N x N x (num_classes +
1) x (num_anchor_boxes), where N x N is the number of cells in the grid,
num_classes is the number of classes in the dataset, and num_anchor_boxes is
the number of anchor boxes.
Architecture:
YOLO (v1) has overall 24 convolutional layers, four max-pooling layers, and two
fully connected layers.
217
YOLO v8:
Link: https://docs.ultralytics.com/#ultralytics-yolov8
218
can also increase the number of grids to accommodate for multiple detections in
a (previously) larger grid size, say for small objects that are very close to each
other.
(1) Finding suitable anchor boxes (in shape and size) is crucial in training an
excellent anchor-based object detection model. Finding suitable anchors is a
complex problem and may need hyper-parameter tuning.
(2) Using more anchors results in better accuracy in anchor-based object
detection but using more anchors comes at a cost. The model needs more
complex architecture, which leads to slower inference speed.
(3) Anchor free object detection is more generalizable. It predicts objects as
points that can easily be extended to key-points detection, 3D object detection,
etc. However, the anchor-based object detection solution approach is limited to
bounding box prediction.
YOLO expects the input dataset to have 2 sub folders: “images” & “labels”.
“images” contain (N) input images; and “label” contains N (.txt) files, where each
line of the text file (having the same name as the input image it corresponds to)
describes a bounding box.
Ex:
219
The annotation file for the image above looks like the following:
There are 3 objects in total (2 persons and one tie) in the above input
image. Each line represents one of these objects. The specification for each line
is as follows.
- One row per object
- Each row is class x_center, y_center, width, height format.
- Box coordinates must be normalized by the dimensions of the image (i.e.
have values between 0 and 1). Note that x_center, y_center, width, height values
are wrt input image’s normalized coordinates.
- Class numbers are zero-indexed (start from 0).
220
All the above information is to be provided to the model during training, via a
single YAML file (say data.yaml), that has the following information in below
format:
train: <path to training dataset ex: ../../dataset/yolo_data/train/images>
val: <path to validation dataset ex: ../../dataset/yolo_data/val/images>
test: <path to testing data>
# number of classes
nc: <N>
# class names
names: <example:["Bus","Truck"]>
YOLO expects to find the training/validation labels for the images in the folder
whose name can be derived by replacing “images” with “labels” in the path to
dataset images.
For example, if the input training data path is { /dataset/train/images }, YOLO will
look for training labels in { /dataset/train/labels }.
Installation:
pip install ultralytics (installs everything needed for Yolo-v8)
Usage:
221
res_plotted = results[0].plot() # plot results(bb, masks, classification logits, etc)
on image.
cv2.imshow(res_plotted)
An alternate form of input for images, when providing the input image from
memory (such as numpy array or tensor) are shown below, along with shape
format & input value range, as per YOLO’s website, under “Predict” section:
YOLO also allows exporting it’s model in other formats like ONNX, etc.
222
Link: https://docs.ultralytics.com/modes/train/
Ex:
model = YOLO('yolov8n.yaml') # build a new model from YAML (for training only). This
yaml is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/v8/
yolov8.yaml.
OR
model = YOLO(<pretrained model name/path, say 'yolov8n.pt'>) # transfer
learning.
model.train(data='coco128.yaml', epochs=100, imgsz=640) # Train the model.
# for other training parameters info that can be provided, see
https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/cfg/default.yaml OR
https://docs.ultralytics.com/modes/train/#arguments. Can also specify “device” arg here, to run
training on GPU (or multiple GPU ex: device='0,1') or cpu(device='cpu'), batch, image
size(imgsz), optimizer, learning rate, etc.
223
Illustrated green curve is best, red curve is inferior. Other curves are actual
training data plottings.
There is always a tradeoff between Precision & Recall, because, if we keep the
threshold tight, Precision will be high, but Recall will be low. If we are liberal with
the threshold, Recall will be high, but Precision suffers.
A high area under the curve represents both high recall and high precision,
where high precision relates to a low false positive rate, and high recall relates to
a low false negative rate.
224
A poor object detector needs to increase the number of detected objects
(increasing False Positives = lower precision) in order to retrieve all ground truth
objects (high recall). That's why the Precision-Recall curve usually starts with
high precision values, decreasing as recall increases.
225
Average Precision Calculation techniques.
YOLO comes with a long list of architectures. Some are large and some
are small, to train on large or small datasets. Configurations can have different
backbones. There are pre-trained configurations for standard datasets.
(c) Neck - The neck(such as FPN or BiFPN) connects the backbone and
the head. It is composed of a spatial pyramid pooling (SPP) module and a path
aggregation network (PAN). The neck concatenates the feature maps from
different layers of the backbone network and sends them as inputs to the head.
226
Adds a SPP block after backbone to increase the receptive field and separate out
the most important features from the backbone.
NOTE:
YOLO v3 uses the concept of "Feature Pyramid Networks" (FPN). FPNs are a
CNN architecture used to detect objects at multiple scales. They construct a
pyramid of feature maps, with each level of the pyramid being used to detect
objects at a different scale. This helps to improve the detection performance on
small objects, as the model is able to see the objects at multiple scales.
Both YOLO v3 and YOLO v4 use anchor boxes with different scales and aspect
ratios to better match the size and shape of the detected objects.
YOLO v5 uses a more complex architecture called EfficientDet
(architecture shown below), based on the EfficientNet network architecture for
higher accuracy and better generalization to a wider range of object categories.
YOLO v5 uses a new method for generating the anchor boxes, called "dynamic
anchor boxes". It involves using a clustering algorithm to group the ground truth
bounding boxes into clusters and then using the centroids of the clusters as the
anchor boxes. This allows the anchor boxes to be more closely aligned with the
detected objects' size and shape.
YOLO v7 uses nine anchor boxes, which allows it to detect a wider range
of object shapes and sizes compared to previous versions, thus helping to
reduce the number of false positives.
A key improvement in YOLO v7 is the use of a new loss function called “focal
loss”. Previous versions of YOLO used a standard cross-entropy loss function,
which is known to be less effective at detecting small objects. Focal loss battles
this issue by down-weighting the loss for well-classified examples and focusing
on the hard examples - the objects that are hard to detect.
227
In simple words, Focal Loss (FL) is an improved version of Cross-
Entropy Loss (CE) that tries to handle the class imbalance problem by assigning
more weights to hard or easily misclassified examples.
YOLOv7 isn't just an object detection architecture - it provides
new model heads; that can output keypoints (skeletons) and perform instance
segmentation besides only bounding box regression, which wasn't standard with
previous YOLO models. This isn't surprising, since many object detection
architectures were repurposed for instance segmentation and keypoint detection
tasks earlier as well, due to the shared general architecture, with different outputs
depending on the task. Even though it isn't surprising - supporting instance
segmentation and keypoint detection will likely become the new standard for
YOLO-based models, which have begun outperforming practically all other two-
stage detectors a couple of years ago (link: https://stackabuse.com/pose-
estimation-and-keypoint-detection-with-yolov7-in-python/).
For keypoint/pose estimation in YOLOv8, see
https://docs.ultralytics.com/tasks/pose/.
Alternatively, we can use YOLO for obtaining the sub-image of detected
object(BB), then use our own custom model for pose/keypoint estimation on this
smaller sub-image.
- Concat - Concat is a slicing layer and is used to slice the previous layer.
- CBS(Convolution, BatchNorm, SiLU)
- C3(convolution block with 3 convolutions) - CSP(Cross Stage Partial layer) with
Bottlenecks i.e. C3 is composed of three convolution layers and a module
cascaded by various bottlenecks.
- C2f(convolution block with 2 convolutions)
- Bottleneck(two 3x3 convs with residual connections) - A bottleneck layer is a
layer that contains few nodes compared to the previous layers. It can be used to
obtain a representation of the input with reduced dimensionality.
- SPP(Spatial Pyramid Pooling) - a type of pooling layer used to reduce the
spatial resolution of the feature maps. SPP is used to improve the detection
228
performance on small objects, as it allows the model to see the objects at
multiple scales. Also to increase the receptive field and separate out the most
important features from the backbone. Also used to remove the fixed size
constraint of the network.
It enables the network to accept inputs of different sizes and generate fixed-
length feature representations by dividing the input feature map into regions of
different sizes and pooling features from each region separately.
The lr0 parameter is the initial learning rate, and lrf is the multiplicative factor for
the final learning rate at the last epoch of training i.e. final lr = [lr0 * lrf] - so if
lr0(starting lr) = 0.1, and lrf=0.01; then final lr = (0.1 * 0.01) = 0.001.
If lrf = 1, learning rate remains constant throughout.
By default, in YOLOv8, both lr0 and lrf have the same value (0.01), and
this value is used during training.
Regarding the cos_lr parameter, if it is set to True, then the learning rate
schedule will follow a cosine annealing pattern rather than a linear schedule. This
can lead to a smoother learning rate schedule and potentially better results. Both
lr0 and lrf are still used in the cosine annealing learning rate schedule.
229
3) Code for YOLOv8 on Bus-Truck dataset:
Preparing data:
import pandas as pd
230
from sklearn.model_selection import train_test_split
import shutil # for file copying operations.
import os
# This file has helper functions for preparing & copying Bus-Truck data in a format &
path structure, as required by YOLO for training purposes.
"""
custom training:
1) Prepare a yaml file (say data.yaml) that contains
(a) the paths to training & validation data/images (containing the input images &
label txt files(1 txt file per image - YOLO expects annotations for each image in form
of a .txt file where each line of the text file describes a bounding box)
respectively).
(b) number of output classes/labels.
(c) names of the labels.
# prepare label .txt file data from csv; for each image in in Bus-Truck dataset. The
labels are in accordance with YOLO's requirement i.e. as many (<class number> <BB X-
center> <BB Y-center> <BB width> <BB height>) as BBs in a given image, in the label
file for that image.
def prepare_BusTruck_labels():
#images_path = "../dataset/images"
labels = ["Bus", "Truck"]
labels_path = "../dataset/labels/" # this will contain the output label txt
files for all images in "images_path".
df = pd.read_csv("../dataset/df.csv") # load entire dataset labels file.
unique_images = df["ImageID"].unique().tolist() # 15225 unique images.
print(f"Data preparation ({len(unique_images)} images) start:")
for image in unique_images:
rows_for_image = df[df["ImageID"] == image]
# create a new txt file & write all rows(data) corresponding to "image", in
this file.
with open(f"{labels_path}{image}.txt","x") as f:
for _, row in rows_for_image.iterrows():
231
class_number = labels.index(row["LabelName"]) # get class number from
values(0 based).
# values are start/end pixels. Convert them the BB center points &
width-height.
x_center = (row["XMin"] + row["XMax"]) / 2
y_center = (row["YMin"] + row["YMax"]) / 2
width = row["XMax"] - row["XMin"]
height = row["YMax"] - row["YMin"]
# string containing all the data.
row_data = f"{class_number} {x_center} {y_center} {width} {height}"
f.write(row_data)
f.write("\n")
print("prepare_BusTruck_labels() completed.")
#prepare_BusTruck_labels()
# generate_train_validate_data() divides & copies the input images & their labels, to
train & val folder, to be used by YOLO as training & validation data respectively.
def generate_train_validate_data():
df = pd.read_csv("../dataset/df.csv") # load entire dataset.
unique_images = df["ImageID"].unique()
# split database into training(train_ids) & validation(val_ids).
train_ids, val_ids = train_test_split(unique_images, test_size=0.1,
random_state=99)
print(f"Training data len: {len(train_ids)}, Validation data len: {len(val_ids)}")
# path to actual input images & their labels, all at one place.
input_image_file_path = "../dataset/images/"
input_label_file_path = "../dataset/labels/"
232
# path where input data is to be copied, under "train" & "val" folders.
output_file_path = "../dataset/yolo_data/"
#generate_train_validate_data()
# number of classes
nc: 2
# class names
names: ['Bus','Truck']
YOLOv8 training/validation/Prediction:
233
from ultralytics import YOLO
import cv2
# perform validation.
def validate_model():
# link: https://docs.ultralytics.com/modes/val/
model = YOLO("runs/detect/train/weights/best.pt") # no arguments needed, dataset
and settings remembered.
metrics = model.val(data="data.yaml") #model.val() #results =
model.val(data="data.yaml")
#metrics.box.map # map50-95
#metrics.box.map50 # map50
#metrics.box.map75 # map75
print(metrics.box.maps) # a list contains map50-95 of each category.
print("\nValidation complete.\n")
234
# predict() is used to perform inference using an already trained model.
def predict(img_name):
model = YOLO("runs/detect/train/weights/best.pt")
# pick up from validation folder data.
img_path = "../dataset/yolo_data/val/images/"
# "stream=True" returns generator(memory efficient). Input to predict() can be
image path, url, opencv|PIL image, tensor, directory, video, etc. link:
https://docs.ultralytics.com/modes/predict/#image-formats for more info on input &
results structure.
results = model.predict(img_path + img_name)
res_plotted = results[0].plot()
235
print(f"\nBB: {all_bbs[i]}\nClass: {output_class_name}\nConfidence:
{confidence[i]}\n")
#train_model()
#resume_training("runs/detect/train/weights/best.pt", 10)
#validate_model()
Both mAP(B) and mAP(P) provide valuable insights into the model's
performance, depending on the specific requirements of the application.
Typically, mAP(B) is more commonly used for standard object detection tasks,
236
while mAP(P) is used when precise pixel-level segmentation is required, as in
instance segmentation.
If your training was interrupted for any reason you may continue (resume) where
you left off using the --resume argument (in python: model.train(resume=True)).
YOLO Learning Rate (LR) schedulers follow predefined LR curves for the
fixed number of --epochs defined at training start, and are designed to fall to a
minimum LR on the final epoch for best training results. For this reason you can
not modify the number of epochs once training has started.
If your training was fully completed, you can start a new training (extended to
more epochs) from any model (checkpoint) using the --weights argument i.e. If
you would like to start training from a (previously) fully trained model, use the --
weights argument, not the --resume argument:
Ex (CLI): python train.py --weights path/to/best.pt # start from a pretrained model.
For Python, the flag pretrained can be set to True, to load weights from a (path to
a) pretrained model.
237
To customize YOLO (say model, loss function, etc), see
https://docs.ultralytics.com/usage/engine/.
(1) Create a custom trainer class derived from a base trainer class (PoseTrainer
for PoseModel for key points) (see link:
https://docs.ultralytics.com/usage/engine/).
(2) Create a custom model class derived from a suitable Base model class
(PoseModel for key points) (see link:
https://docs.ultralytics.com/reference/nn/tasks/#ultralytics.nn.tasks.PoseModel).
CODE:
Customization:
Below example code illustrates how to customize YOLO loss function for
Pose (key points prediction), by customizing YOLO’s Trainer, inside trainer it’s
Model, and finally inside model, it’s Loss function.
# This file contains definitions for custom trainer, model & loss functions in YOLO.
238
def __init__(self, model, loss_scaler_param):
super().__init__(model)
self.loss_scaler = loss_scaler_param
def init_criterion(self):
# set your custom loss function here.
super().init_criterion()
return CustomPoseLoss(self, self.loss_scaler) # return custom loss class.
1st param is model object (self). 2nd param is the loss scaler.
239
# Define CUSTOM TRAINER for pose, derived from Ultralytics PoseTrainer.
# Usage example: CustomPoseTrainer(overrides=<dict>, loss_scaler_param = 10)
class CustomPoseTrainer(PoseTrainer):
def __init__(self, cfg=DEFAULT_CFG, overrides=None, _callbacks=None,
loss_scaler_param=1):
super().__init__(cfg, overrides, _callbacks)
self.loss_scaler = loss_scaler_param
self.overrides = overrides # will be used in set_overrides_in_config(),
that is called in get_model(); to override the contents from “DEFAULT_CFG”.
# This function is used for manually overriding default config (cfg) with
overrides parameter. This should have happened automatically in Ultralytics
super().__init__() though.
def set_overrides_in_config(self, cfg):
yaml_override = yaml_load(self.overrides['data'])
cfg['kpt_shape'] = yaml_override['kpt_shape']
cfg['names'] = yaml_override['names']
return cfg
Usage:
Illustrates how to use the above customized classes for training.
# this file trains using custom trainer; with a custom YOLO model (that has custom
loss function set internally, from the file "yolo_custom_trainer.py").
def train(e=30):
args = dict(model='yolov8n-pose.pt', data='yolo-pose.yaml', epochs=e)
trainer = CustomPoseTrainer(overrides=args, loss_scaler_param=10) #
loss_scaler_param = magnitude to scale the loss with.
print("Training Started:\n\n")
trainer.train()
print("Training Complete.")
return
240
train(60)
#resume_training(40)
Code:
241
Once you have created your custom dataloader using the create_dataloader
function and defined your custom preprocessing function, you can pass it directly
to your train function.
Instead of passing the .yaml file as the data argument, you can pass the
train_dataloader you created as the imgsz argument. Also, you can remove the
data and batch_size arguments because they will be taken care of by the
train_dataloader you created.
Code:
model = YOLO(...)
model.train(train_dataloader, imgsz=imgsz, epochs=20)
The train function will automatically use the dataloader you pass as the first
argument.
In a network, different layers have different receptive fields to the original image.
For example, the initial layers have a smaller receptive field when compared to
the final layers, which have a larger receptive field.
SSD leverages this phenomenon, to predict classes & BBs:
(a) We leverage the pre-trained (say VGG) network and extend it with a few
additional layers until we obtain a 1 x 1 block.
(b) Instead of leveraging only the final layer for bounding box and class
predictions, we will leverage all of the last few layers to make class and
bounding box predictions (difference between SSD & other detectors like YOLO).
(c) In place of anchor boxes, we will come up with default boxes that have a
specific set of scale and aspect ratios. Each of the default boxes should predict
the object and bounding box offset just like how anchor boxes are expected to
predict classes and offsets.
242
SSD doesn’t use k-means to find the anchors. Instead it uses a mathematical
formula to compute the anchor sizes. Therefore, SSD’s anchors are
independent of the dataset (hence named default/prior boxes).
Another small difference: YOLO’s (initial versions) anchors are just a width and
height, but SSD’s anchors also have an x,y-position. YOLO simply assumes that
the anchor’s position is always in the center of the grid cell (for SSD this is also
the default thing to do).
Thanks to anchors, the detectors don’t have to work very hard to make
pretty good predictions already, because predicting all zeros simply outputs the
anchor box, which will be reasonably close to the true object (on average). This
makes training a lot easier. Without the anchors, each detector would have to
learn from scratch what the different bounding box shapes look like… a much
harder task.
Model Summary:
Output:
243
SSD(
(backbone): SSDFeatureExtractorVGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=True)
(17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): ReLU(inplace=True)
(19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
)
(extra): ModuleList(
(0): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): ReLU(inplace=True)
(3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): ReLU(inplace=True)
(5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Sequential(
(0): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
(1): Conv2d(512, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6))
(2): ReLU(inplace=True)
(3): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1))
(4): ReLU(inplace=True)
)
)
244
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(3): ReLU(inplace=True)
)
(2): Sequential(
(0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(3): ReLU(inplace=True)
)
(3): Sequential(
(0): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
(3): ReLU(inplace=True)
)
(4): Sequential(
(0): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1))
(3): ReLU(inplace=True)
)
)
)
(anchor_generator): DefaultBoxGenerator(aspect_ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]],
clip=True, scales=[0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05], steps=[8, 16, 32, 64, 100, 300])
(head): SSDHead(
(classification_head): SSDClassificationHead(
(module_list): ModuleList(
(0): Conv2d(512, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): Conv2d(256, 546, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 364, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(regression_head): SSDRegressionHead(
(module_list): ModuleList(
(0): Conv2d(512, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Conv2d(1024, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(512, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
245
(3): Conv2d(256, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Conv2d(256, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(transform): GeneralizedRCNNTransform(
Normalize(mean=[0.48235, 0.45882, 0.40784], std=[0.00392156862745098,
0.00392156862745098, 0.00392156862745098])
Resize(min_size=(300,), max_size=300, mode='bilinear')
)
)
246
Inputs:
The input to the model is expected to be a list of tensors, each of shape [N, C, H,
W], for one image, it is [1, C, H, W]. Image values should be in 0-1 range.
The behaviour of the model changes depending if it is in training or
evaluation mode.
Training:
During training, the model expects both the input (image) tensors (1st arg), as
well as a targets (list of dictionary), containing:
- boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format,
with 0 <= x1 < x2 <= W and 0 <= y1 < y2 <= H. i.e. {top-left=(x1,y1) , bottom-
right=(x2,y2)}.
- labels (Int64Tensor[N]): the class label for each ground-truth box.
Ex: [{"boxes":<Tensor.shape(N,4)>, "labels":<Tensor.shape(N)>},
{"boxes":<Tensor.shape(N,4)>, "labels":<Tensor.shape(N)>}, ...]
The length of this list is equal to the number of images in input (1st arg).
The model returns a Dict[Tensor] during training, containing the classification and
regression losses.
Inference:
During inference, the model requires only the input tensors, and returns the post-
processed predictions as a List[Dict[Tensor]], one for each input image. The
fields of the Dict are as follows, where N is the number of detections:
- boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format,
with 0 <= x1 < x2 <= W and 0 <= y1 < y2 <= H.
- labels (Int64Tensor[N]): the predicted labels for each detection
scores (Tensor[N]): the scores for each detection.
247
NOTE: YOLO is usually used when accuracy is more important, SSD
is used when speed (higher frame rate) is more desired, trading off for accuracy;
in general.
The authors view object detection as a set prediction problem (with absolute
box prediction w.r.t. the input image rather than an anchor). A set prediction
problem is when you try to guess a group of items based on some information.
Thinking of object detection as a set prediction problem has some
challenges. The biggest one is getting rid of duplicate predictions.
By treating object detection as a set prediction problem, the need for
manually designed components previously required in object detection
tasks to incorporate prior knowledge is eliminated, like spatial anchors or
non-maximal suppression. This approach simplifies the process and
streamlines the task.
DETR predicts all objects at once, and is trained end-to-end with a set loss
function which performs bipartite matching between predicted and ground-truth
objects.
Paper name: End-to-End Object Detection with Transformers (by Meta AI).
Transformer:
248
Transformers have proven to be a remarkable architecture for sequence-to-
sequence problems. This class of networks uses only linear layers and softmax
to create self-attention. Self-attention helps in identifying the
interdependency among words in the input text. The input sequence typically
does not exceed 2,048 items as this is large enough for text applications.
However, if images are to be used with transformers, they have to
be flattened, which creates a sequence in the order of thousands/millions of
pixels (as a 300 x 300 x 3 image would contain 270K pixels), which is not
feasible. Facebook Research came up with a novel way to bypass this restriction
by giving the feature map (which has a smaller size than the input image) as
input to the transformer.
Multihead Attention.
249
Each K vector comes paired with a V value. The greater the compatibility
of a given K with Q, the greater influence the concerned V will have on the output
of the attention mechanism.
These Q, K & V are learnt by the model during training.
K & V come from encoder outputs that are fed to the decoder; while Q
comes from the decoder’s masked MHA (MMHA), which is propagated further in
the decoder.
WORKING:
250
sees all the words in the sequence, but only a part of their embeddings - it sees
the full sentence, but a different aspect of each word i.e. each head will try to
relate all the words with each other using different aspects of the same word.
Note that there will be eight sets of tensors of Kw11, Kw12, and so on because
there are eight multi-heads.
In each part, we first perform matrix multiplication between the key and query
matrices. This gives a result that indicates what in the input (key) is relevant, with
respect to (or from the point of view of) the query.
This way, we end up with a 3 x 3 matrix. Pass it through softmax activation.
Now, we have a matrix showing how important each word is, in relation to every
other word, as high probabilities. Elements with low probabilities indicate words
that are not important/relevant (with respect to the query) - also called “masking
out”.
Dot products are used to find the cosine similarity between two
vectors. The dot-product between tensors simply expands this to find many
different cosine similarities between the different vectors inside the tensors.
Finally, we perform matrix multiplication of the preceding tensor output with the
value tensor to get the output of our self-attention operation.
So, the output that we get here (attention) contains the information of:
(a) original word embeddings for all words in the input sentence.
(b) positional encoding.
(c) interaction of each word with all other words in the input.
We then combine the eight outputs of this step, go back using concat layer
(step3 in the following diagram), and end up with a single tensor of size 512 x 3.
251
Because of the splitting of the Q, K, and V matrices, the layer is also called multi-
head self-attention.
For our example in computer vision, when searching for an object such as a
horse, the query should contain information to search for an object that is large in
dimension and is brown, black, or white in general. The softmax output of scaled
dot-product attention will reflect those parts of the key matrix that contain this
colour (brown, black, white, and so on) in the image. Thus, the values output
from the self-attention layer will have those parts of the image that are roughly of
the desired colour and are present in the values matrix.
We use the self-attention block several times in the network.
Working of DETR:
252
between predicted and ground truth objects, and then optimizes object-specific
(bounding box) losses.
DETR demonstrates significantly better performance on large objects, a
result likely enabled by the non-local computations of the transformer. It obtains,
however, lower performances on small objects.
There are few key differences between a normal transformer network and DETR:
1) Our input is an image, not a sequence. So, DETR passes the image
through a ResNet backbone to get a vector (feature map) of size 256 that can be
then treated as a sequence (tokens).
2) The inputs to the decoder are object-query embeddings, which are
automatically learned during training. These act as the query matrices for all the
decoder layers.
3) This enriched feature map of the image is given to a transformer encoder-
decoder, which outputs the set of box predictions, through Feed Forward
Network (FFN) i.e. prediction heads. Each of these boxes consists of a tuple. The
tuple will be a class and a bounding box.
Note: This also includes the class NULL or Nothing class as
well (background class).
4) Unlike the original transformer that outputs the prediction sequence in time
steps i.e. one element at a time (during inference), our transformer outputs all the
253
N predictions (labels & their bounding boxes) in parallel, at the same time, at
each decoder layer.
Now, this is a real problem as in the annotation there is no object class annotated
as nothing. Comparing and dealing with similar objects next to each other is
another major issue and in this paper, it is tackled by using Bipartite matching
loss (bipartite matching for assigning predictions to ground truth uniquely, then
using Hungarian loss for calculating loss - includes the matching loss &
bounding box loss).
The loss is compared by comparing each class and bounding box there is with its
corresponding class and box (predictions) including the none class, which are
let’s say N; with the annotation including the part added that contains nothing to
make the total boxes N. The assignment of the predicted to the actual is a one to
one assignment such that the total loss is minimized. There is a very famous
algorithm called the Hungarian method to compute these minimum matching.
i.e. With the model always outputting a fixed number of object labels, many of
the object labels need to be classified as “No Object”. Additionally, these objects
could be in any order because any of the object queries could detect a specific
object. Therefore, when training the model, the labeled data is padded with “No
Object” labels to match with the size of the model’s output. A bipartite matching
algorithm is then used to match each object with their respective true value.
254
The matching algorithm completes the matching by minimizing the loss
due to mismatched or poorly localized objects. If there is a mismatch in objects,
the algorithm is forced to match predictions that are not the same as the label,
penalizing the model loss greatly. Extra object predictions will be forced to match
with “No Object” labels. The model, therefore, directly learns to predict the
correct number of objects, so there is no need to use non-max suppression.
255
DETR Architecture.
256
From Paper "End-to-End Object Detection with Transformers":
Transformer encoder:
The encoder expects a sequence as input, hence we collapse the
spatial dimensions of z0 into one dimension, resulting in a d×HW feature map.
Each encoder layer has a standard architecture and consists of a multi-head
self-attention module and a feed forward network (FFN). Since the transformer
architecture is permutation-invariant, we supplement it with fixed positional
encodings that are added to the input of each attention layer.
Transformer decoder:
The decoder follows the standard architecture of the transformer,
transforming N (N=100 in paper "End-to-End Object Detection with
Transformers") embeddings of size d using multi-headed self- and encoder-
decoder attention mechanisms. The difference with the original transformer is
that our model decodes the N objects in parallel at each decoder layer.
Since the decoder is also permutation-invariant, the N input embeddings
must be different to produce different results. These input embeddings are
LEARNT(*) positional encodings that we refer to as object queries, and similarly
to the encoder, we add them to the input of each attention layer. The N object
queries are transformed into an output embedding by the decoder.
They are then independently decoded into box coordinates and class
labels by a feed forward network (FFN), resulting in N final predictions. Using self
and encoder-decoder attention over these embeddings, the model globally
reasons about all objects together using pairwise relations between them, while
being able to use the whole image as context.
257
FFN: The final prediction is computed by a 3-layer perceptron with
ReLU activation function and hidden dimension d, and a linear projection layer.
The FFN predicts the normalized center coordinates, height and width of the box
w.r.t. the input image, and the linear layer predicts the class label using a
softmax function.
Since we predict a fixed-size set of N bounding boxes, where N is usually much
larger than the actual number of objects of interest in an image, an additional special
class label ∅ is used to represent that no object is detected within a slot. This class plays
a similar role to the “background” class in the standard object detection approaches.
The inputs to the decoder are object-query embeddings (used for conditioning
information - each of these object queries can be thought of as a single question
on whether an object is in a certain region. This also means that each object
query represents how many objects that the model can detect), which are
automatically learned during training. These act as the query matrices for all the
decoder layers. Similarly, for every layer, the key and query matrices are going to
be the final output matrix from the encoder block, replicated twice.
In the decoder, you can find attention layers where the query, key, and
values aren’t taken from the same source, but rather only the key and value
tensors are taken from the encoder whereas the query tensor is taken from the
previous decoder layer. The attention layer otherwise works exactly the same
way, only with that slight modification in inputs. This integrates the information
from the encoder into the decoder, i.e. it asks the question “how do the words
(classes) in the input word sequence relate to the words (classes) that have been
outputted so far”.
258
For the decoder in DETR, it does not make sense to start off its input with a start
token. In fact, object detection does not need to detect objects in sequence, so
there is no need to continuously concatenate the output to the next input in the
decoder.
Instead, a fixed number of trainable inputs (in DETR, 100) are used for the
decoder called object queries.
Object queries are learnt positional encodings. The N object queries are
transformed into an output embedding by the decoder. They are then
independently decoded into box coordinates and class labels by a feed forward
network (FFN), resulting in N final predictions.
Using self- and encoder-decoder attention over these embeddings, the
model globally reasons about all objects together using pairwise relations
between them, while being able to use the whole image as a context.
Training settings for DETR differ from standard object detectors in multiple ways.
The new model requires an extra-long training schedule and benefits from
auxiliary decoding losses in the transformer.
259
In the preceding code, we are specifying the following:
The layers of interest in sequential order (self.backbone)
The convolution operation (self.conv)
The transformer block (self.transformer)
The final connection to obtain the number of classes (self.linear_class)
The bounding box (self.linear_box)
Define the positional embeddings for the encoder and decoder layers:
self.query_pos is the positional embedding input for the decoder layer, whereas
self.row_embed and self.col_embed form the two-dimensional positional
embeddings for the encoder layer.
260
return { 'pred_logits': self.linear_class(h), 'pred_boxes':
self.linear_bbox(h).sigmoid() }
ULTRALYTICS RT-DETR:
Code:
261
KEYPOINT DETECTION:
This project uses ViT as a backbone, with custom head for detection 36
keypoints (36*2(for x,y) = 72 outputs) for Left ventricle wall detection in
echocardiogram frames.
Model:
# https://github.com/gpastal24/ViTPose-Pytorch
import torch
import torchvision
import torch.nn as nn
from torchvision.models.vision_transformer import vit_b_16
class ViTPose(nn.Module):
def __init__(self, num_kp: int, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.model_name = "ViTPose.pth"
# use vit_b_16 as backbone.
262
# can also use weights=ViT_B_16_Weights.DEFAULT to get the most up-to-date
weights.
self.backbone = vit_b_16(torchvision.models.ViT_B_16_Weights)
# freeze weights.
for params in self.backbone.parameters():
params.requires_grad = False
# use "num_kp" as number of keypoints.
# (1) try changing the default head itself.
# self.backbone.heads = nn.Linear(in_features=768, out_features=num_kp,
bias=True)
# (2) try adding a new keypoint head.
self.backbone.heads = nn.Sequential(nn.Linear(in_features=768,
out_features=400, bias=True), nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.4),
nn.GELU(), nn.Linear(in_features=400, out_features=num_kp, bias=True), nn.GELU())
# (3) 3 layer head.
# self.backbone.heads = nn.Sequential(nn.Linear(in_features=768,
out_features=2048, bias=True), nn.BatchNorm1d(num_features=2048), nn.GELU(),
nn.Linear(in_features=2048, out_features=1024, bias=True),
nn.BatchNorm1d(num_features=1024), nn.GELU(), nn.Linear(in_features=1024,
out_features=num_kp, bias=True), nn.ReLU())
# self.backbone.heads = nn.Sequential(nn.Linear(in_features=768, out_features=400,
bias=True), nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.4), nn.GELU(),
nn.Linear(in_features=400, out_features=400, bias=True),
nn.BatchNorm1d(num_features=400), nn.Dropout1d(0.6), nn.GELU(),
nn.Linear(in_features=400, out_features=num_kp, bias=True), nn.ReLU())
import torch
import torchvision
import glob
import cv2
from torch.utils.data import Dataset
from torchvision import transforms
263
# this dataset is for synthetic camus dataset files, that have been stored in YOLO
format (.jpgs & corresponding annotation .txt files).
class CamusDataset(Dataset):
# the preprocessing as explained at:
https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html#torc
hvision.models.vit_b_16
preprocess = transforms.Compose([transforms.Resize(240,
interpolation=transforms.InterpolationMode.BILINEAR), transforms.CenterCrop(224),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
def __len__(self):
return len(self.files)
264
Train-Validate:
import torch
import torch.nn as nn
import cv2
# performs training
def train(epochs, dTrainLoader, dValidateLoader, model, lossFn, optimFn, scheduler):
iters = len(dTrainLoader)
for epoch in range(epochs):
model.train()
train_epoch_loss = 0
loss_scaler = 1 # to scale the loss magnitude, for heavy penalty.
for i, (input_images, kp_gt) in enumerate(dTrainLoader):
optimFn.zero_grad()
prediction_kp = model(input_images) # perform inference.
# compute loss.
kp_loss = lossFn(prediction_kp, kp_gt) * loss_scaler
kp_loss.backward()
optimFn.step()
train_epoch_loss += kp_loss.item() # add to overall loss.
# use scheduler step.
if scheduler is not None:
scheduler.step()
#scheduler.step(epoch + i/iters)
# perform validation after every epoch.
total_val_loss = validate(dValidateLoader, model, lossFn)
# print average training loss.
print(f"Epoch {epoch+1}: Training Loss: {train_epoch_loss:0.3f} , Validation Loss:
{total_val_loss:0.3f}")
return
265
#print(f"Validation loss: {total_val_loss}.")
return total_val_loss
# visualizes GT in green.
def plot_gt(filename, image):
data = None
filename = filename.replace(".jpg", ".txt")
with open(filename) as f:
data = f.read()
data = data.split(" ")
data = data[5:] # get only keypoint data.
data = list(map(float, data))
# process GT wrt image dimensions.
266
gt_data = torch.tensor(data, dtype=torch.float).to(device)
gt_data = gt_data.view(-1, 2)
img_scaler = torch.tensor([image.shape[1], image.shape[0]], dtype=torch.int)
gt_data *= img_scaler
gt_data = gt_data.to(dtype=torch.int)
for (x,y) in gt_data:
x = x.item()
y = y.item()
image = cv2.circle(image, (x,y), 2, (0,255,0), 2)
return image
Main:
import torch
import torch.nn as nn
import torch.optim as optim
from train_validate import train, predict
from ViTPose import ViTPose
from CamusDataset import CamusDataset
from torch.utils.data import DataLoader
import os, glob
267
# optimizer with L2 regularization (weight_decay).
optimFn = optim.Adam(model.parameters(), lr=0.0005, weight_decay=1e-4)
#no_of_batches_per_epoch = len(train_dataset) / batch_s
scheduler = torch.optim.lr_scheduler.LinearLR(optimFn, start_factor=1,
end_factor=0.001, total_iters=epochs)
#scheduler = None #torch.optim.lr_scheduler.CyclicLR(optimFn, 0.00005, 0.0005,
step_size_up=no_of_batches_per_epoch/4, step_size_down=no_of_batches_per_epoch*4,
mode='exp_range', gamma=0.8, cycle_momentum=False)
#scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimFn, T_0=10,
T_mult=1, eta_min=0.00005, last_epoch=-1, verbose=False)
try:
train(epochs, dTrainLoader, dValidateLoader, model, lossFn, optimFn,
scheduler)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs, save model.
torch.save(model.state_dict(), save_model_path + model.model_name)
print(f"Model saved to {model.model_name}")
else:
# if no Ctrl+C was pressed, declare training complete.
print("------------Training Complete-----------------")
torch.save(model.state_dict(), save_model_path + model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
return
#run_training("/Users/abhijitpoojary/Desktop/pixelbin_projects/datasets/synthetic-
camus-val/", 60)
268
(C) SEGMENTATION:
Architectures such as U-Net (for semantic) & Mask R-CNN (for instance) can be
used for image segmentation.
Usually in object detection, while predicting the class of an object and the
bounding box corresponding to the object, we pass the image through a network,
flatten the output at a certain layer, and connect additional dense layers before
making predictions for the class and bounding box offsets.
However, in the case of image segmentation, where the output shape is
the same as that of the input image's shape, flattening the convolutions' outputs
and then reconstructing the image might result in a loss of information.
Furthermore, the contours and shapes present in the original image will not
vary in the output image in the case of image segmentation.
The two aspects that we need to keep in mind while performing
segmentation are as follows:
- The shape and structure of the objects in the original image remain the
same in the segmented output.
- Leveraging a fully convolutional architecture (and not a structure where we
flatten a certain layer) can help here since we are using one image as input and
another as output.
269
labels). Each mask is the segmentation of one instance in the image. Used in
Mask R-CNN.
(b) Polygon coordinates. Each row of the array contains the (x,y) coordinates
of a polygon along the boundary of one instance in the image. Used in YOLOv8
segmentation.
SEMANTIC SEGMENTATION:
UNet:
UNet, evolved from the traditional convolutional neural network, was first
designed and applied in 2015 to process biomedical images. The reason it is
able to localize and distinguish borders is by doing classification on every pixel,
so the input and output share the same size.
UNet Architecture.
270
In the left half (Encoder) of the preceding diagram, we can see that the image
passes through convolution layers, and that the image size keeps reducing while
the number of channels keeps increasing (to capture different details or features
in our image).
However, in the right half (Decoder), we can see that we are upscaling the
downscaled image, back to the original height and width but with as many
channels as there are classes.
In addition, while upscaling in the right half, we are also leveraging information
from the corresponding layers in the left half using skip connections so that we
can preserve the structure/objects in the original image.
NOTE: The difference between Residual connections & UNet skip
connections is that in Residual connections, we add the previous inputs, whereas
271
in U-Net, we concatenate the previous input signals along the channel
dimension.
This way, the U-Net architecture learns to preserve the structure (and
shapes of objects) of the original image while leveraging the convolution's
features to predict the classes that correspond to each pixel. In general, we have
as many channels in the output as the number of classes we want to predict.
The number of convolutions in the decoder should be of sufficient number,
in order to process the convolutions (i.e. feature maps generated) of the encoder,
in order to reconstruct the image/mask.
The most commonly used loss function for the task of image segmentation is a
pixel-wise cross entropy loss. This loss examines each pixel individually,
comparing the class predictions (depth-wise pixel vector) to our one-hot encoded
target vector.
272
U-net uses a loss function for each pixel of the image. This helps in easy
identification of individual cells within the segmentation map. Softmax is applied
to each pixel, followed by a loss function.
This converts the segmentation problem into a classification problem where we
need to classify each pixel to one of the classes.
Because the cross entropy loss evaluates the class predictions for each
pixel vector individually and then averages over all pixels, we're essentially
asserting equal learning to each pixel in the image. This can be a problem if your
various classes have unbalanced representation in the image, as training can be
dominated by the most prevalent class.
Class weights concepts can be used for weighting this loss for each output
channel in order to counteract a class imbalance present in the dataset.
273
where |A ∩ B| represents the common elements between sets A (predicted
mask/set of pixels) and B (ground truth/target mask/set of pixels), and |A| represents the
number of elements in set A (and likewise for set B). For the case of evaluating a Dice
coefficient on predicted segmentation masks, we can approximate |A ∩ B| as the
element-wise multiplication between the prediction and target mask, and then
sum the resulting matrix.
Dice is an F1 score.
When applied to Boolean data, using the definition of true positive (TP), false
positive (FP), and false negative (FN), it can be written as:
It is different from the Jaccard index which only counts true positives once in both
the numerator and denominator.
274
In order to formulate a loss function which can be minimized, we can
simply use (1 - Dice). This loss function is known as the Soft Dice loss because
we directly use the predicted probabilities instead of thresholding and converting
them into a binary mask.
A soft Dice loss is calculated for each class separately and then averaged
to yield a final score.
Ex Code:
# Soft dice loss calculation for arbitrary batch size, number of classes, and number of spatial
dimensions. Assumes the `channels_last` format.
def soft_dice_loss(y_true, y_pred, epsilon=1e-6):
# Arguments
y_true: b x X x Y( x Z...) x c One hot encoding of ground truth
y_pred: b x X x Y( x Z...) x c Network output, must sum to 1 over c channel (such as after
softmax)
epsilon: Used for numerical stability to avoid division by zero errors.
# skip the batch and class axis for calculating Dice score.
axes = tuple(range(1, len(y_pred.shape)-1))
numerator = 2 * np.sum(y_pred * y_true, axes)
denominator = np.sum(np.square(y_pred) + np.square(y_true), axes)
275
The Tversky loss is another type of loss used with segmentation, & is based on
the Tversky index for measuring overlap between two segmented images, &
based on generalization of Dice loss. The Tversky index (TIc) between one image
Y and the corresponding ground truth T is given by:
Jaccard index:
The Jaccard coefficient measures similarity between finite sample sets, and is
defined as the size of the intersection divided by the size of the union of the
sample sets:
276
By design, 0 <= J(A, B) <= 1.
NOTE: |A U B| (i.e. |A| + |B| - |A ∩ B|) is not the same as (|A| + |B|).
In confusion matrices employed for binary classification, the Jaccard index can
be framed in the following formula:
Key Differences:
1) The Dice coefficient tends to give more weight to the intersection of the
sets, as it uses the average size of the sets in the denominator. This makes it
more sensitive to smaller overlaps.
2) The Jaccard index, on the other hand, gives equal weight to both the
intersection and union of the sets. It is sometimes preferred when you want a
more balanced measure of similarity.
3) In some applications, the Dice coefficient is considered to be more
stringent, as it may penalize small differences more heavily compared to the
Jaccard index.
4) The Dice coefficient is often used in the context of binary data, such as
pixel-wise segmentation in images, where each pixel is either part of the region
of interest or not. The Jaccard index is more versatile and can be applied to a
wider range of data types and applications.
277
In summary, both the Dice coefficient and the Jaccard index are measures of set
similarity, but they differ in how they combine the sizes of the intersection and
union of sets, making them suitable for different contexts and applications.
D = 2J / (1 + J).
J = D / (2 - D).
Accuracy:
Accuracy is a more general metric used for classification tasks. It measures the
ratio of correctly predicted instances (both positive & negative) to the total
number of instances.
Accuracy is useful when all classes in a classification problem are of equal
importance and have a similar distribution in the dataset.
Accuracy provides an overall measure of how well a model is performing
across all classes. It is a useful metric when classes are balanced, but it can be
misleading when dealing with imbalanced datasets because a model can achieve
high accuracy by simply predicting the majority class.
Accuracy takes into account TN too, whereas Precision, Recall, Dice & Jaccard
only consider TP (i.e. evaluate the quality of positive predictions).
278
1) Code for Semantic Segmentation with UNet on Road Dataset:
Dataset:
import os
from torchvision import transforms
from torch.utils.data import Dataset
import torch, cv2
import numpy as np
class RoadDataset(Dataset):
def __init__(self, path, isTrain) -> None: # path = "../dataset"
super().__init__()
resizeValue = 224
self.path = path
self.preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
self.data = [] # holds the data(input images/segmentation masks) names,
without extensions.
# set folder names to pick data from, depending on training or validation.
self.images_foldername = ""
self.seg_foldername = ""
if isTrain:
self.images_foldername = "/images_prepped_train/"
self.seg_foldername = "/annotations_prepped_train/"
else:
self.images_foldername = "/images_prepped_test/"
self.seg_foldername = "/annotations_prepped_test/"
filepath = path + self.images_foldername
# get list of all images in train or validation folder.
self.data = [os.path.splitext(filename)[0] for filename in
os.listdir(filepath)]
#print(len(self.data))
def __len__(self):
279
return len(self.data)
Model:
import torch
import torch.nn as nn
from torchvision.models import vgg16_bn, VGG16_BN_Weights
"""
VGG16-BN summary:
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(5): ReLU(inplace=True) # block1 (2 Convs).
280
(7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(9): ReLU(inplace=True)
(10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(12): ReLU(inplace=True) # block2 (2 Convs, 1
maxpool).
281
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(36): ReLU(inplace=True)
(37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(39): ReLU(inplace=True)
(40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True,
track_running_stats=True)
(42): ReLU(inplace=True)
(43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
# bottleneck.
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)
"""
class RoadSegmentationModel(nn.Module):
def __init__(self, out_channels) -> None:
super().__init__()
self.model_name = "trained_models/RoadSegmentationModel.pth"
# use 'features' block from vgg16_bn pretrained model as convolutions providing
feature maps.
self.encoder = vgg16_bn(weights=VGG16_BN_Weights).features
self.down_block1 = nn.Sequential(*self.encoder[:6]) # from 0-5.
self.down_block2 = nn.Sequential(*self.encoder[6:13]) # from 6-12.
self.down_block3 = nn.Sequential(*self.encoder[13:20])
self.down_block4 = nn.Sequential(*self.encoder[20:27])
self.down_block5 = nn.Sequential(*self.encoder[27:34])
282
# freeze weights.
#for param in self.parameters():
# param.requires_grad = False
self.bottleneck = nn.Sequential(*self.encoder[34:])
self.conv_bottleneck = self.conv(512, 1024) # output channels is 512 here,
see VGG16_BN summary.
# a unit that performs upscaling of image, via transposed convolution. output image
size depends on kernel size & stride values, among other (default value - not set
here) variables.
# ConvTranspose2D output image dimensions formula:
# input width, height: Win, Hin. output width, height: Wout, Hout.
# Hout = (Hin−1)×stride[0] − 2×padding[0] + dilation[0] × (kernel_size[0]−1) +
output_padding[0] + 1
283
# Wout =(Win−1)×stride[1] − 2×padding[1] + dilation[1] × (kernel_size[1]−1) +
output_padding[1] + 1
# stride=2, kernel size=2, padding=0, dilation=1, output_padding=0.
# Hout = (Hin-1)*2 - 2*0 + 1 * (2-1) + 0 + 1
# = (Hin-1)*2 - 0 + 1 * 1 + 0 + 1
# = (Hin-1)*2 + 2
# similarly for Wout.
def up_conv(self, in_channels, out_channels):
return nn.Sequential(
nn.ConvTranspose2d(in_channels, out_channels, kernel_size=2, stride=2),
nn.ReLU(inplace=True)
)
bottleneck = self.bottleneck(block5)
input = self.conv_bottleneck(bottleneck)
284
input = self.up_block(input, block3, self.up_conv8, self.conv8) # up block
3.
input = self.up_block(input, block2, self.up_conv9, self.conv9) # up block
4.
input = self.up_block(input, block1, self.up_conv10, self.conv10) # up
block 5.
input = self.conv11(input)
return input
Train/Validate:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
285
optimFn.zero_grad()
pred_masks = model(input_images)
# compute loss and accuracy.
loss, accuracy = lossFn(pred_masks, label_masks)
print(f"Validation per Batch => Loss={loss.item():0.3f},
Accuracy={accuracy.item():0.3f}")
print("\n")
main:
import torch
import torch.nn as nn
from RoadDataset import RoadDataset
from torch.utils.data import DataLoader
from RoadSegmentationModel import RoadSegmentationModel
from train_validate import train, validate
import torch.optim as optim
import numpy as np
import cv2
from torchvision import transforms
286
torch.max() returns these max values[0] & their indices[1] - indices are the channel
numbers i.e. 0-11.
_, preds_indices_masks = torch.max(pred_prob_masks, 1)
# compute accuracy, based on similarity of pixel values between prediction & label
pixel values.
accuracy = (preds_indices_masks == label_masks).float().mean()
return ce_loss, accuracy
def run_validation():
dataset = RoadDataset("../dataset", False)
dLoader = DataLoader(dataset, batch_size=10, shuffle=False, drop_last=True)
model = RoadSegmentationModel(classes).to(device)
optimFn = optim.Adam(model.parameters(), lr=0.0005)
lossFn = UNetLoss
result = model.load_state_dict(torch.load(model.model_name))
print("\n")
287
print("------------Validation Started-----------------")
validate(model, dLoader, lossFn, optimFn)
print("------------Validation Complete-----------------")
print("\n")
# to resume training.
def resume_training(epochs = 20):
dataset = RoadDataset("../dataset", True)
# drop_last=True drops the last non-full batch of the dataset.
dLoader = DataLoader(dataset, batch_size=10, shuffle=True, drop_last=True)
model = RoadSegmentationModel(classes).to(device)
# load previously trained model
result = model.load_state_dict(torch.load(model.model_name))
optimFn = optim.Adam(model.parameters(), lr=0.0005)
lossFn = UNetLoss
try:
print("\n")
print("------------Training Resumed-----------------")
train(epochs, model, dLoader, lossFn, optimFn)
except KeyboardInterrupt: # if "Ctrl+C" interrupt occurs,
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
else:
print("------------Training Complete-----------------")
torch.save(model.state_dict(), model.model_name)
print(f"Model saved to {model.model_name}")
print("\n")
288
# preprocess input image before prediction.
def preprocess_image_for_prediction(image):
# use COLOR_BGR2RGB here, for use as input in model.
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
preprocess = transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
image = preprocess(image)
return image
# performs prediction on a single image. Displays both label & predicted mask for
performance comparison.
def perform_prediction(path, img_path, label_path, input_filename):
# Create a new folder with 1 test image & it's label mask, then use RoadDataset to
point to this folder.
# read input image.
input_image = cv2.imread(path + img_path + input_filename)
input_image = cv2.resize(input_image, (224, 224)).astype(np.float32)
input_image = (input_image/255) # if we don't do this, entire image is
displayed as white color in cv.imshow(). No need to use COLOR_BGR2RGB here, as
cv.imshow() handles it internally.
# create & load trained model.
model = RoadSegmentationModel(classes).to(device)
result = model.load_state_dict(torch.load(model.model_name))
# preprocess input image as a separate tensor, to be fed into model.
preprocessed_input_image = preprocess_image_for_prediction(input_image)
# run inference.
predicted_mask = model(preprocessed_input_image.unsqueeze(0)) # add batch
dimension to input.
_, predicted_mask = torch.max(predicted_mask, dim=1) # merge 12 channel image to
1 channel image.
cv2.imshow("Input Image", input_image)
cv2.moveWindow("Input Image", 10, 20) # set window position.
# load input image's label mask & preprocess it, if it exists.
label_mask = None
value_multiplier = int(255/11) # apply this on segment mask, for
visualization purpose.
# LABEL image is used only for display purpose.
label_mask = cv2.imread(path + label_path + input_filename, cv2.IMREAD_GRAYSCALE)
if label_mask is not None:
label_mask = cv2.resize(label_mask, (224, 224))
289
label_mask = torch.from_numpy(label_mask).long()
label_mask *= value_multiplier # modify mask values from range 0-11 to 0-255.
(255/11=23)
cv2.imshow("Label Mask", label_mask.numpy().astype(np.uint8).copy())
cv2.moveWindow("Label Mask", 250, 20)
# modify prediction mask values from range 0-11 to 0-255.
predicted_mask *= value_multiplier
# convert predicted_mask from shape=[1, 224, 224] to shape=[224, 224]
predicted_mask = torch.squeeze(predicted_mask)
cv2.imshow("Predicted Mask", predicted_mask.numpy().astype(np.uint8).copy())
cv2.moveWindow("Predicted Mask", 500, 20)
cv2.waitKey()
print("Inference complete.\n")
#resume_training(10)
#perform_prediction("../dataset/", "images_prepped_train/",
"annotations_prepped_train/", "0006R0_f02100.png") # from training set.
INSTANCE SEGMENTATION:
Mask R-CNN:
290
The Mask R-CNN architecture helps in identifying/highlighting the instances of
objects of a given class within an image; & is an extension of the Faster R-CNN,
with the following modifications:
- The RoI Pooling layer has been replaced with the RoI Align layer.
- A mask head has been included to predict a mask of objects in addition to
the head, which already predicts the classes of objects and bounding box
correction in the final layer.
- A fully convolutional network (FCNN) is leveraged for mask prediction.
FCNN is a network (used mainly for semantic segmentation) that does not
contain any “Dense” layers (as in traditional CNNs with Fully connected
(flattened) layers); instead it contains 1x1 convolutions that perform the task of
fully connected layers (Dense layers). They employ solely locally connected
layers, such as convolution, pooling and upsampling.
No dense layers means less parameters to deal with.
Working:
In the preceding diagram, note that we are fetching the class and bounding box information
from one layer and the mask information from another layer.
291
Mask-RCNN architecture.
RoI Align:
292
Note that the region (in dashed lines) is not equally spread across all the cells in
the feature map.
We must perform the following steps to get a reasonable representation of the
region in a 2 x 2 shape:
(1) First, divide the region into an equal 2 x 2 shape:
(2) Define four points that are equally spaced within each of the 2 x 2 cells:
293
The distance between two consecutive points is 0.75.
(3) Calculate the weighted average value of each point based on its distance
to the nearest known value (bilinear interpolation):
(4) Repeat the preceding interpolation step for all four points in a cell:
(5) Perform average pooling across all four points within a cell:
Ex: (0.21778 + 0.27553 + 0.14896 + 0.21852)/4 = 0.86079/4 = 0.21519
294
By implementing the preceding steps, we don't lose out on information when
performing RoI Align; that is, when we place all the regions inside the same
shape.
Using RoI Align, we can get a more accurate representation of the region
proposal that is obtained from the Region Proposal Network.
Mask Head:
Typically, in the case of object detection, we would pass the RoI Align through a
flattened layer in order to predict the object's class and bounding box offset.
However, in the case of image segmentation, we predict the pixels within a
bounding box that contains the object. Hence, we now have a third output (apart
from class and bounding box offset), which is the predicted mask within the
region of interest.
Here, we are predicting the mask, which is an image overlay on top of the
original image. Given that we are predicting an image, instead of flattening the
RoI Align's output, we'll connect it to another convolution layer/s to get another
image-like structure.
The label masks for instance segmentations are in the form of binary masks i.e.
Height x Width x NumberOfInstances:
Height = height of the mask image (& input data image)
Weight = Weight of the mask image (& input data image)
NumberOfInstances = number of instances in input image.
295
mask_to_polygon.py (converts mask from image into x,y points; as required by
YOLO):
import os
import cv2
import numpy as np
import shutil
# indentation: 4 spaces.
# separates individual class masks from the RGB mask image. also returns the
individual class indices.
def separate_underwater_mask(rgb_mask, H, W):
classes = []; masks = []
gray_mask = cv2.cvtColor(rgb_mask, cv2.COLOR_RGB2GRAY) # convert rgb mask to
grayscale.
296
# get the class mask for current class, if any.
class_mask = np.zeros((H, W), dtype=rgb_mask.dtype) # get a blank mask.
diff_mask_indices = (gray_mask == gray_code[0][0]) # get indices of pixels
matching out class mask code.
class_mask[diff_mask_indices] = 255 # set all such matched masks to highest
value.
# debugging only.
#cv2.imshow("class mask", class_mask)
#cv2.waitKey()
# append only if mask was detected for the current class mask code.
if class_mask.max() != 0:
classes.append(index)
masks.append(class_mask)
297
# writes points info into file 'f'.
def write_points_in_file(polygons:list, class_index, f):
for polygon in polygons: # for each new contour in list of contours, add a
new entry of it's class index with points info.
f.write(f"{class_index} ")
for p in polygon: # write a single contour points info.
#f.write('{} '.format(p))
f.write(f'{p[0]} {p[1]} ')
f.write('\n') # put "\n" at end of each contour.
return
for j in os.listdir(input_dir):
image_path = os.path.join(input_dir, j)
if os.path.isdir(image_path):
continue
# (2) print the points for all class masks in the mask image, into a file.
with open('{}.txt'.format(os.path.join(output_dir, j)[:-4]), 'w') as f:
# for each class mask in a given label, generate & store it's points data.
298
for index, mask in enumerate(masks):
# each mask image contains only one object(mask).
polygons = convert_mask_to_polygons(mask, H, W)
# (1) add class number here, in case of multiclass segmentation.
write_points_in_file(polygons, classes[index], f) # write points
info in file, for a single class mask.
f.close()
print("Conversion complete")
# TRAINING DATASET.
#in_dir = "../dataset/underwater/train_val/masks" # path to input masks
(images).
#out_dir = "../dataset/underwater/train_val/labels" # path where output mask
points (text files) should go. output folder should exist beforehand.
# TEST DATASET.
in_dir = "../dataset/underwater/test/masks"
out_dir = "../dataset/underwater/test/labels"
convert_label_mask_to_yolo_format(in_dir, out_dir)
data.yaml:
train: /Users/abhijitpoojary/Desktop/PyTorchVSCode/Segmentation/YOLO8_Segmentation/
dataset/underwater/train_val/images # Ultralytics seems to store the previous path
somewhere, hence the need to provide absolute path.
val: /Users/abhijitpoojary/Desktop/PyTorchVSCode/Segmentation/YOLO8_Segmentation/
dataset/underwater/test/images
# number of classes
nc: 8
# class names
names: ["background", "human divers", "plants", "ruins", "robots", "reefs &
invertebrates", "fish & vertebrates", "sea floor & rocks"]
299
Training / Inference:
# YOLO performs instance segmentation. For pretrained YOLO segmentation model weights,
see link: https://docs.ultralytics.com/models/yolov8/#supported-tasks
# Before training on underwater dataset, (pretrained only) model detects only person,
not robot or reef.
# After training/Fine-tuning, model segments all classes with good accuracy.
300
# perform validation.
def validate_model():
# link: https://docs.ultralytics.com/modes/val/
model = YOLO("runs/segment/train/weights/best.pt") # no arguments needed,
dataset and settings remembered. It seems all the necessary data is stored in
ultralytics database somewhere.
metrics = model.val(data="data.yaml") #model.val() #results =
model.val(data="data.yaml")
#metrics.box.map # map50-95
#metrics.box.map50 # map50
#metrics.box.map75 # map75
print(metrics.box.maps) # a list contains map50-95 of each category.
print("\nValidation complete.\n")
301
all_boxes = result.boxes.boxes # this is for obtaining the class
indices(at array index 5 i.e. 6th element - all_boxes shape is (N,6), where N is the
number of detections).
predicted_class_indices = all_boxes[:,5].to(torch.uint8)
masks = result.masks # masks.segments(bounding coordinates of masks),
masks.data(raw masks tensor). "masks" contains all the masks(classes) in the input
test image.
img = cv2.imread(img_path + img_name) # read original image.
#img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # no need to do this, image is
displayed properly by opencv automatically.
cv2.imshow("original image", result.orig_img)
cv2.moveWindow("original image", 10, 10)
if not masks or masks.shape[0] == 0:
print("\nNo Detections")
cv2.waitKey(0)
else:
superimpose_segments(img, masks, predicted_class_indices, result.names)
print("\nPrediction completed.\n")
302
#train_model()
#resume_training("runs/segment/train/weights/best.pt", 10)
#validate_model()
predict("d_r_47_.jpg")
#predict("d_r_58_.jpg")
#predict("d_r_84_.jpg")
#predict("d_r_122_.jpg")
#predict("d_r_129_.jpg")
Object Tracking (& motion estimation) is the process of locating & tracking a
moving object (or multiple objects) over time (video) using a camera. It has a
variety of uses, some of which are: human-computer interaction, security and
surveillance, video communication and compression, augmented reality, traffic
control, medical imaging and video editing.
The objective of video tracking is to associate target objects in
consecutive frames.
Object tracking is an application where the program takes an initial set of
object detections and develops a unique identification for each of the initial
detections and then tracks the detected objects (maintaining their identification)
as they move around frames in a video. In other words, object tracking is the task
of automatically identifying objects in a video and interpreting them as a set of
trajectories with high accuracy (including when objects are occluded, or
disappear for a few frames & come back again).
303
- ByteTrack (handles occlusion & hence unwanted ID switching too - can be
used in conjunction with other trackers).
- Kalman Filter (predicting next position), Hungarian/Kuhn–Munkres
algorithm (optimization algo for data association i.e. matching objects in previous
& current frames).
- Optical Flow
(https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.html).
- Track Any Point (TAP) techniques:
(a) TAPIR (from Deepmind - Tracking Any Point with per-frame Initialization and
temporal Refinement) - an robust technique for tracking any given point/pixel in a
video.
(b) CoTracker (from Meta AI - Tracking multiple points (say belonging to the
same object) together, so that tracking accuracy is better).
304