Object Detection and Localization: Academic Session 2022/2023
Object Detection and Localization: Academic Session 2022/2023
Object Detection and Localization: Academic Session 2022/2023
Localization
For object detection, we need to classify the objects in an image and also
find the bounding box (i.e. what is in the image and where the objects are).
Task: find a pre-defined object in an image
Timeline
Approaches to Object Detection
Classification
Feature extraction
Feature extraction:
This is our Backbone Network
Predictions on a Grid:
Use a Backbone Network
A very large labelled dataset (such as ImageNet) can be used to train
the backbone network in order to learn good feature representations.
Predictions on a Grid:
The Backbone Network Output
After pre-training remove the last few layers of the network. The backbone
network now outputs a collection of stacked feature maps which describe the
original image in a low spatial resolution but high feature (channel)
resolution (7x7x512 in this network)
Predictions on a Grid:
Relating Back to the Original Image
Predictions on a Grid:
Object is at the Centre Cell
Objects are roughly located in the coarse (7x7) feature maps at the cell
containing the centre of the bounding box annotation. This grid cell is
"responsible" for detecting that specific object.
Predictions on a Grid:
How to Detect Centre Cell
In order to detect the object, add another convolutional layer and learn
the kernel parameters which combine the context of all 512 feature maps
in order to produce an activation corresponding with the grid cell which
contains the object
Predictions on a Grid:
Multiple Activations
5+C convolutional filters
produce one bounding box
descriptor for each grid cell
Predictions on a Grid:
Multiple Objects on the Same Cell
Images might have multiple objects which "belong" to the same grid cell.
We can alter the layer to produce B(5+C) filters such that we can predict B
bounding boxes for each grid cell location
Predictions on a Grid:
Multiple Objects on the Same Cell
The model will always produce a fixed number of N×N×B predictions for a
given image. We then filter the predictions to only consider bounding boxes
with a probability above some defined threshold.
Predictions on a Grid:
Multiple Objects in Parallel
Multiple objects can be detected in parallel.
However, a large number of grid cells contain no object and this introduces a
large imbalance between the predicted bounding boxes which contain an object
and those which do not contain an object.
Predictions on a Grid:
Non-maximum Suppression
The approach thus far produces a fixed number of bounding box predictions for
each image. BUT we would like output bounding boxes for objects that are
actually likely to be in the image.
Lin et al. (2017) explained why methods like SSD are less accurate than two-stage
methods and proposed to address the problem by rescaling the loss function. The
improvement implemented as RetinaNet means that single-shot methods are faster
and now as accurate as two-stage methods.
YOLO: You-Only-Look-Once &
SSD: Single Shot MultiBox Detector
The idea is to divide the image into multiple grids. Then change the label of the data
such that we implement both localization and classification algorithm for each grid cell
How to Get the Bounding Box
One of the things that may be difficult to understand at first is how the detection system
will convert the cells to an actual bounding box that fits the object
Strategies to Define the
Bounding Box
YOLO works similarly to SSD with the difference that it uses fully
connected layers instead of only convolutional layers at the end of the
network. SSD seems superior.
The SSD Detector
An input image is passed through the backbone truncated network.
In this example, three more convolutions create three feature maps at
the top of the network with the shapes [256, 4, 4] (blue), [256, 2, 2]
(yellow), and finally [256, 1, 1] (green):
SSD Receptive Field in the Last
Layer
The activations in the final layer have dependencies
on all activations in the previous layers and the
receptive field is thus the entire input image.
SSD Receptive Field in Other
Layers
Note that the receptive field of an activation in the
yellow layer is only one quarter of the input image
SSD Anchor Boxes
This leads to anchor or default boxes. Every default box needs n values that
represent the probabilities that a certain class was detected in that box and 4 values
that now are not absolute coordinates of the predicted bounding box but rather the
offset to the respective default box.
The important idea is that we do for every default box in the differently sized grids what we
did when predicting one single object in an image.
SSD Summary
In Summary:
1. We defined several grids of differently sized default boxes that will allow us to detect
objects at different scales in one single forward pass.
2. For each default box in every grid, the network outputs n class probabilities and 4
offsets to the respective default box coordinates which give the predicted bounding
box coordinates.
Matching Bounding Boxes:
Compare Predicted with Default
We want to match the ground truth bounding box to a default box that is “as
similar” to it as possible. Two boxes are similar when they overlap as much
as possible while having as little as possible area that is non-overlapping.
This is defined by the Jaccard index or IoU Intersection over Union index:
The idea is that we want to compare the ground truth bounding boxes in the training
example to predictions made by default boxes that are already very similar to the ground
truth bounding boxes.
One More Thing
More boxes of different sizes and aspect ratios = better object detection.
This detail is not too important for general understanding of SSD, but
important for implementation.
Recall: methods like SSD or YOLO suffer from an extreme class imbalance: The
detectors evaluate roughly between ten to hundred thousand candidate locations (way
more than the 4x4, 2x2 + 1 default boxes in the previous example shown here).
Cross Entropy Loss Function
Entropy is a measure of the uncertainty associated with a given
distribution.
Lin et al. (2017) scale the cross entropy loss so that all the easy examples the
network is already very sure about contribute less to the loss so that the learning
can focus on the few interesting cases. Gamma=2 seems to work best.
Final Notes on RetinaNet
With Focal Loss, when the network is pretty sure about a prediction, the
loss is now significantly lower.
In our previous example of 80% certainty, the cross entropy loss had a
value of ~0.22 and now the focal loss a value of only ~0.04.
For predictions the network is not so sure about, the loss got reduced by a
much smaller factor.
With this powerful improvement, single forward pass are able to compete
with two-stage methods regarding accuracy while easily beating them with
respect to speed.
This opens many new possibilities for accurate real-time object detection
even on embedded systems.