Computer Vision 3
Computer Vision 3
ABSTRACT Detecting objects remains one of computer vision and image understanding applications’ most
fundamental and challenging aspects. Significant advances in object detection have been achieved through
improved object representation and the use of deep neural network models. This paper examines more closely
how object detection has evolved in the era of deep learning over the past years. We present a literature
review on various state-of-the-art object detection algorithms and the underlying concepts behind these
methods. We classify these methods into three main groups: anchor-based, anchor-free, and transformer-
based detectors. Those approaches are distinct in the way they identify objects in the image. We discuss
the insights behind these algorithms and experimental analyses to compare quality metrics, speed/accuracy
tradeoffs, and training methodologies. The survey compares the major convolutional neural networks for
object detection. It also covers the strengths and limitations of each object detector model and draws
significant conclusions. We provide simple graphical illustrations summarising the development of object
detection methods under deep learning. Finally, we identify where future research will be conducted.
INDEX TERMS Object detection, deep learning, review, convolutional neural networks, transformers,
survey, neural networks.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 35479
A. B. Amjoud, M. Amrouch: Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review
beyond shallow and deep CNNs. For a better understanding of and the number of models mentioned and covered. How-
the dynamics and interactions between objects in these visual ever, understanding any domain and developing new concepts
scenes, it is necessary to use sequential and relational infor- necessitates knowledge of all existing concepts, including
mation modeling to connect objects both in time and space. their pros and cons, particularly in a fast-developing field
However, before introducing and clarifying these advanced such as object detection. Our work brings some added value
techniques, it is worthwhile first to understand the evolution to the field of object detection. Therefore, it will provide
of state-of-the-art object detectors, their limitations, and how researchers, especially those starting in this field or those
they can be addressed. This paper presents an in-depth review interested in applying these techniques in other specific disci-
of several approaches for solving the object detection task. plines, such as healthcare, with an up-to-date, state-of-the-art
We will explore and discuss the different frameworks used overview of object detection.
for object detection and the primary data sets and metrics 1) We propose an up-to-date survey that covers older and
applied to evaluate the detection. We describe the advantages more recently published object detection models.
and limitations of the most widely used convolutional neu- 2) We present the first review, which covers almost all
ral networks, serving as a backbone for the leading object object detection models based on deep learning.
detection models. Initially, we cover algorithms from the 3) We compare the different backbone networks object
anchor-based family for object detection, including two- and detectors use through their strengths, features, and
one-stage object detectors. We also review more sophisticated limitations.
and faster algorithms based on anchor-free and transformer- 4) We suggest a research study outlining and investigating
based object detection approaches. Next, we elaborate on generic object detection approaches from the perspec-
each approach’s strengths and weaknesses by comparing the tive of anchors and transformers.
methods mentioned in the paper. Then, we shall provide a 5) We summarise the evolution and categories of object
discussion of some future directions and prospects. detection with deep learning in simplified charts, dia-
grams, and tables.
A. COMPARISON WITH PREVIOUS REVIEWS 6) We outlined promising future directions in the field of
All previous studies [22], [35] were limited to an overview object detection.
and comparison of a limited number of object detection
models, although other models were available at their time.
II. TRADITIONAL OBJECT DETECTION METHODS
Most previous surveys followed the same method of divid-
The first notable strides in object detection and image recog-
ing the models into two categories; two-stage and one-stage
nition began in 2001 when Paul Viola and Michael Jones
detectors. Moreover, some have just focused on one aspect
designed an effective facial detection algorithm [36], a robust
of object detection. For example, some have studied the
binary classifier built from multiple low classifiers. Their
detection of salient objects [26], [30]. Others have studied
demonstration of faces detected in real-time on a webcam was
the detection of small objects [33], [34], and others for tiny
the most impressive illustration of computer vision. In 2005,
objects [31]. In [32], they review the learning strategies of a new paper by Navneet Dalal and Bill Triggs was published.
object detector models. In this paper, we tried to cover all Their approach, based on the feature descriptor, Oriented
the detection models and approaches that depended on deep
Gradient Histograms (HOG) [37], outperformed existing
learning from 2013 to 2022, including the object detec-
pedestrian detection algorithms. In 2009 Felzenszwalb
tion models based on transformers published more recently.
et al. developed the Deformable Part Model (DPM) [38],
No previous work has comprehensively covered and analyzed
another crucial feature-based model. As a result, DPM has
the number of models we have listed. We also divided the
proven to be highly successful in object detection applications
detection models into four categories. The first concerns two-
in which bounding boxes were applied to localize objects,
stage models based on anchors, the second relates to one-
as well as in template matching and other well-known object
stage models based on anchors, the third refers to anchor-free
detection approaches used at the time. Several methods have
methods, and the last category concerns transformer-based
already been developed to extract patterns from images and
models.
detect objects [39], [42]. All traditional methods tend to
involve three parts: 1) The first step consists in inspecting
B. OUR CONTRIBUTIONS the entire image at multiple positions and scales to generate
The primary motivation of this work is to provide a compre- candidate boxes with the use of methods like sliding window
hensive, detailed, and simplified overview through tables and [43], [44], max-margin object detection, region proposal like
figures of the past and current state of the field of object detec- the selective search algorithm [45]. Usually, with sliding
tion. This paper can be a starting point for researchers and windows, capturing several thousand windows in each image
engineers seeking to gain knowledge in this field, especially is usually necessary. Any costly calculation method used at
for those beginning their careers. They can learn about the this first level results in a prolonged process of scanning the
current situation and contribute to advancing the field. Our entire image. Especially during training, several iterations on
contribution differs from previous ones regarding its focus the training set are often necessary to include the selected
‘‘hard’’ negatives. 2) The second step, feature extraction, ana- classification algorithms such as Support Vector Machine
lyzes the generated regions to extract visual features or image (SVM) [50], Adaboost [51], Deformable Part-based Model
patterns. With traditional object detection techniques, design- (DPM) [46] and K-Nearest Neighbors [52]. Three essential
ing these features for the algorithm’s performance is vital. elements determine how well any object detection framework
To do this, we apply methods such as Haar-Like features [46], performs: the feature set, the classifier, the learning method,
HOG [37], Scale-Invariant Feature Transform (SIFT) [47], and the training set. In particular, most traditional meth-
Speeded Up Robust Feature (SURF) [48], and Binary Robust ods that have been most efficient in recent PASCAL VOC
Independent Elementary Features (BRIEF) [49]. 3) Finally, detection challenges [53] have used several feature channels
the last step consists in classifying these entities, regard- combined with detectors that include multiple aspects and
less of whether they contain an object or not, by using mobile parts.
During 2008-2012, experiments conducted on PASCAL these challenges’ datasets. These datasets vary according to
VOC using these traditional methods had become marginal, different perspectives regarding the number of images and
with minor improvements; this has highlighted the short- outputs per image, the number of labeled classes, and image
comings of traditional detectors and the need to develop size. Some key performance metrics have been implemented
more robust approaches. The issue with traditional techniques for the spatial position and the predicted classes’ accuracy.
such as those mentioned above that use sliding windows, for
instance, where a rectangle of various sizes slides across the A. DATASETS
entire image trying to locate appropriate objects, requires a This paper compares all the object detection algorithms
high level of computation effort, and it generates more dupli- based on deep learning in the three most popular benchmark
cate windows. The work of the subjacent classifier crucially datasets. PASCAL VOC 2007, PASCAL VOC 2012, and
influences the overall output. Traditional approaches in object Microsoft COCO, the ImageNet dataset, were not used due
detection have been based on how we could manually design to their huge size, which necessitates a very high computing
the features or the model according to our understanding. power for training.
We attempt to search for patterns and edges through fil-
tered images to describe them as features and classify them. 1) PASCAL VOC
Nevertheless, according to the most recent advances, it is PASCAL Visual Object Classification (PASCAL VOC)
most efficient to delegate such tasks to the computer so that 2007 and 2012 is a familiar and widely used dataset for object
they can learn for themselves. Following the ImageNet Large detection with about 10,000 training and validation images
Scale Visual Recognition Competition (ILSVRC) launch in with objects and bounding boxes. There are 20 different
2010 [54], the classification error rate for this competition categories in the PASCAL VOC dataset.
was approximately 26% in 2011. After one year, in 2012,
the error rate dropped to 16.4% due to a convolution neural 2) MS-COCO
network model called AlexNet [3]. Its architecture is close The common Objects in COntext (COCO) dataset was devel-
to Yann LeCun’s LeNet-5 [55]. As a result, this was a critical oped by Microsoft and described in detail [56]. The COCO
opening for convolutional neural networks during this period. training, validation, and test sets include over 200,000 images
In the coming years and since 2012, convolution neural net- and 80 object categories.
works have won the battle, and the classification error rate for
ILSRVC has been drastically reduced. 3) ILSRVC
The ImageNet Large Scale Visual Recognition Challenge
III. DATASETS AND EVALUATION METRICS (ILSVRC) [220] is also one of the most well-known data sets
Several datasets are available to support object detection in the object detection field. It started in 2010 as an annual
challenges, and each object detection model is evaluated on challenge for object detection evaluation and continued
until 2017. The dataset is composed of 1000 object classifi- metric that is often expected in the field of object detection
cation classes making a total of more than 1 million images, is IoU. IoU is a metric designed to measure the detection
of which half of which is dedicated to the detection task. quality by calculating the difference between the ground
There are about 200 object classes for the detection task. truth annotations and the predicted bounding boxes. Usually,
an object detection model generates several bounding boxes
4) OPEN IMAGES for each detected object. Through IoU and the threshold we
Open Images [221] is a dataset introduced by Google under set, we can eliminate some bounding boxes that fail to appear
the Creative Commons Attribution license. It comprises about more accurate. An IoU value close to 1 indicates that the
9.2 million labeled and unified ground-truth images and seg- detection is more accurate.
mentation masks. This database has about 600 object classes Area of union
with almost 16 million bounding boxes. It is considered one IoU =
of the largest databases for object localization. Area of intersection
As mentioned, Pascal VOC and MS-COCO are the ref-
B. EVALUATION METRICS erence datasets for testing and evaluating object detection
To evaluate the performance of object detection models, sci- models. Both challenges rely on mean average precision as
entists have implemented several metrics to make the evalua- the primary metric for evaluating object detector methods.
tion and comparison between these models more relevant and However, there are still several differences in their definitions
fairer. Several metrics, such as Intersection over Union (IoU), and implementations. An additional evaluation metric, mean
Frame Rate per Second (FPS), Precision, Recall, AUC, ROC, average recall, is also applied for the MS-COCO Object
and RP curves, have been deployed. For example, a primary Detection Challenge.
1) MEAN AVERAGE PRECISION convolutional layers, two fully connected hidden layers, and
The mAP value is the mean average precision of all K classes. one fully connected output 1000-way softmax classifier layer.
The average precision (AP) is derived from the precision- AlexNet was the first CNN to win ImageNet Large Scale
recall curve, calculated for all unique recall levels. The Visual Recognition Challenge and is a leading architecture
method of computing AP by the PASCAL VOC challenge for any object-detection task. It uses ReLU activation func-
has changed since 2010. PASCAL VOC Challenge interpo- tions and local response normalization layers.
lates through all data points, compared to only 11 equidis-
tant points. mAP evaluates the regression and classification B. VGGNETS
accuracies. VGGNet [58] is a convolutional neural network architecture
developed in 2014. It uses profound architecture with mul-
2) MEAN AVERAGE RECALL tiple convolutional and fully connected layers. It consists of
The mAR value is the mean value of the RAs for all K classes. five convolutional layers followed by three fully connected
As AP, the average recall (AR) also represents a numerical layers. The VGGNet architecture is known for its use of
metric to compare the detector’s efficiency. AR is the mean small convolutional filters (3 × 3) and a very deep network
recall on all IoU values within the [1, 0.5] interval and can be with 16 to 19 layers. It uses ReLU activation functions and
calculated as twice the area under the IoU recall curve. finishes with a softmax classifier. The main idea behind this
architecture is to use very small filters (3 × 3) to capture fine
TABLE 2. An overview of methods, datasets, and evaluation metrics. details in the images and stack multiple layers to increase the
depth of the network; this way, it can learn more complex
features.
C. RESNETS
ResNet (Residual Network) [59] is an architecture designed
A standard object detection model is divided into four main and published in 2015. It is known for its ability to train pro-
parts: the input, the backbone, the neck, and the head. The found networks without the problem of vanishing gradients,
input can be represented by a single image, a patch, or a pyra- which is a common issue in very deep networks. The original
mid of images. The backbone [57] can be a convolutional neu- paper on ResNet proposed five different sizes of the model:
ral network like VGG [58], ResNet [59], EfficientNet [60], 18, 34, 50, 101 and 152 layers. Since then, many other vari-
SpineNet [61], CSPDarkNet [62], etc. Then there is the neck ants of ResNet have been developed, such as ResNeXt and
which is a network found at the top of the backbone; this net- Wide Residual Networks (WRN). For example, ResNet-34
work is usually composed of many downstream and upstream uses a plain network architecture inspired by VGG-19, adding
paths such as FPN [63], NAS-FPN [64], ASFF [65], PAN [66] shortcut connections. These shortcut connections allow the
and BiFPN [67] or in the form of additional blocks such as model to skip layers without affecting performance. The
SPP [68], RFB [69] and SAM [70]. As for the heads, they can critical innovation of ResNet is the introduction of residual
be classified into two categories: those responsible for dense connections, which allow the network to learn the residual
prediction, such as RetinaNet [71], YOLO [72], SSD [73], mapping between the input and the output of a layer rather
CornerNet [74], and FCOS [75]. And those responsible for than the original mapping. The residual connections allow the
sparse prediction like Faster R-CNN [76], Mask R-CNN [77], network to propagate gradients more quickly and allow for
and RepDet [78]. the training of much deeper networks. The resNet architecture
uses a building block called ‘‘Residual Block,’’ which con-
IV. BACKBONE NETWORKS FOR OBJECT DETECTION tains multiple convolutional and batch normalization layers.
Regarding object detection and building a robust object detec- The final layer is connected to a fully connected layer to
tor model, one of the most important factors that should be classify the images.
considered is the backbone network design. The backbone for
object detection is a convolutional neural network designed to D. INCEPTION-RESNET
provide the foundation for an object detector. The backbone Inception-ResNet [228] is a convolutional neural architecture
network’s primary purpose is to extract features from the that builds on the Inception family of architectures developed
images before submitting them for further steps, such as the by Google in 2016 but incorporates residual connections
localization phase in object detection. There are several stan- similar to the ResNet architecture to improve the flow of
dard convolutional neural network backbones used by object gradients and allow for the training of deeper networks. The
detectors, including VGGNets, ResNets and EfficientNets, Inception architecture is known for its use of multiple paral-
etc., which are pre-trained for classification tasks. lel convolutional and pooling layers, also called ‘‘Inception
modules.’’ Those modules extract features at different scales
A. ALEXNET and then concatenate them before passing them to the next
AlexNet [3] is a convolutional neural network (CNN) archi- layer. It is 164 layers deep and trained on over a million
tecture developed in 2012. It consists of eight layers: five images from the ImageNet database. The final layers are
connected to a fully connected layer to classify the images. activation. The final layers are connected to a fully connected
The network has a similar architecture schema to Inception- layer for image classification. The architecture is character-
v4, but the difference lies in their stems, Inception, and Resid- ized by its use of dense connections, which connect each
ual blocks. The model has achieved excellent performance at layer to every other layer in a feed-forward fashion, which
a relatively low computational cost. draws representational power from feature reuse instead of
extremely deep or wide architectures. Each layer is connected
E. EFFICIENTNETS directly with every other layer in the network, creating a
EfficientNet [60] is a convolutional neural network and scal- dense connectivity pattern that allows the network to prop-
ing method published in 2019 that uniformly scales all dimen- agate the gradients through the network more efficiently and
sions of depth/width/resolution using a compound scaling effectively, enabling the training of deeper networks. The
approach. This allows the network to balance accuracy and architecture allows for a significant reduction in the number
computational efficiency better. It uses a building block called of parameters compared to traditional architectures.
mobile inverted bottleneck convolution (MBConv), which
combines depthwise and pointwise convolutional layers. It is I. SENET
similar to MobileNetV2 and MnasNet but is slightly larger SENet [230] (Squeeze-and-Excitation Network) is a con-
due to an increased FLOP budget. The final layers are con- volutional neural network architecture published in 2017.
nected to a fully connected layer to classify the images. The architecture employs squeeze-and-excitation blocks to
EfficientNet-B0 is the base model with a similar architecture enable the network to perform dynamic channel-wise feature
as other architectures such as ResNet and VGG, but as the recalibration. This feature improves the feature representa-
number increases, such as EfficientNet-B7, the architecture tion capabilities of CNNs. The architecture uses a building
becomes more complex, with more layers, more filters, and block called ‘‘SE block,’’ which contains two sub-layers: a
higher resolution input images. ‘‘Squeeze’’ layer, which reduces the dimensionality of the
feature maps, and an ‘‘Excitation’’ layer, which adaptively
F. GOOGLENET
recalibrates the feature maps. The Squeeze layer applies
GoogLeNet [120], also known as Inception v1, is a con- global average pooling to the feature maps to obtain a channel
volutional neural network architecture based on the Incep- descriptor, which is then passed through a fully connected
tion architecture that Google developed in 2014. It utilizes layer, also called a bottleneck layer, to reduce the dimension-
Inception modules, allowing the network to choose the best ality of the descriptor.
filters for a given input. GoogLeNet is 22 layers deep, with
27 pooling layers, and consists of 9 inception blocks arranged
J. HOURGLASS
into three groups with max-pooling in between, also called
‘‘Inception modules.’’ Those modules extract features at dif- Hourglass [231] architecture is a convolutional neural net-
ferent scales and then concatenate them before passing them work (CNN) used for human pose estimation, object detec-
to the next layer and global average pooling at the end. The tion, and semantic segmentation tasks. The architecture is
GoogLeNet architecture won the 2014 ImageNet Large Scale characterized by its repeated bottom-up and top-down pro-
Visual Recognition Challenge (ILSVRC) competition. cessing, similar to an hourglass shape, which allows the
network to learn the input’s fine and coarse features. The
G. CSPRESNEXT Hourglass architecture consists of several modules stacked on
CSPResNeXt [62] is a convolutional neural network where top of each other. Each hourglass module is a sub-network
the Cross Stage Partial Network (CSPNet) approach is consisting of convolutional and pooling layers at the top,
applied to ResNeXt. CSPNet uses cross-stage partial connec- followed by up-sampling and convolutional layers at the bot-
tions to bypass some of the network’s layers, improve the flow tom that reconstruct the input feature maps. These modules
of gradients and allow for the training of deeper networks. are connected in a ‘‘skip’’ or ‘‘residual’’ connection fashion,
A residual Network with Extreme cardinality or ResNeXt is allowing information to flow from one module to the next.
an architecture that uses a building block called ‘‘ResNeXt
Block,’’ which contains multiple branches of convolutional K. SPINENET
layers with different numbers of filters, allowing the network SpineNet [61] is a convolutional neural network backbone
to learn features at different scales and increases the capacity with scale-permuted intermediate features and cross-scale
of the network. CSPResNeXt is used as a feature extractor in connections learned on an object detection task developed
YOLO v4 and partitions the feature map into multiple stages, by Google AI in 2020. It typically encodes an input image
allowing for better learning capability of CNNs. into a series of intermediate features with decreasing reso-
lutions. The architecture of SpineNet is based on the idea
H. DENSENETS of a ‘‘sparse backbone,’’ which is composed of a sequence
DenseNet [229] is a network that uses dense connections of sparse convolutional layers, called ‘‘spine layers,’’ that
between layers through Dense Blocks, which contain mul- are interleaved with dense layers called ‘‘non-spine layers.’’
tiple convolutional layers, batch normalization, and ReLU The spine layers are lightweight, with fewer parameters and
It returns a unique collection of predictions for each set sample since the prediction is ambiguous; otherwise, if the
anchor box. Generating bounding boxes can be described probability is remarkably less than 0.5, then the anchor box
as follows: (1) Create thousands of candidate anchor boxes is likely to predict that no object is present. Finally, by using
that best describe the objects’ size, location, and shape. this process, we ensure that the model learns to identify only
(2) Predict the offset for each bounding box. (3) Compute true objects. Using anchor boxes allows a network to detect
a loss function for each anchor box based on ground truth. multiple objects, objects of different scales, and overlapping
(4) For each anchor box, compute the Intersection Over objects. In object detection, anchor-based detectors define
Union (IOU) to check which object’s bounding box has the anchor boxes at each position in the feature map. The net-
largest IOU. (5) When the probability is more significant work predicts the probability of objects in each anchor box
than 0.5, notify the anchor box that it should detect the and then fits the size of the anchor boxes to fit the object.
object with the highest IOU. and factor the prediction into However, anchors require careful design and application in
the loss function. (6) If this probability is marginally less object detection frameworks. (a) The coverage ratio of the
than 0.5, we instruct the anchor box not to learn from this instance’s location space is among the most critical factors
in anchor design. To ensure a good recall rate, anchors and a region-based prediction network (R-CNN) [84], [85] to
are thoroughly engineered based on the statistics computed detect objects. Many models were subsequently introduced
from the training/validation set [79], [80]. (b) Some design to improve its performance. For example, using bilinear
choices based on a particular dataset may not apply to other interpolation, the Mask R-CNN [77] replaces the RoIPool
applications, which affects the generality [81]. (c) During the layer with the RoIAlign layer. Other models look at different
learning phase, the anchor-based approaches rely on intersec- aspects to improve performance. For example, some target
tion union (IoU) to define the positive/negative samples, thus the whole architecture, such as [86] and [89], some use multi-
adding extra computation and hyper-parameters for an object scale learning and testing [90], [91], others feature fusion and
detection system [82]. Anchor-based object detection frame- enhancement [63], [92], the introduction of the new loss func-
works generally fall into two sections: two-stage, proposition- tion and training [93], [95], and some employ better proposal
based detectors and one-stage, proposition-free methods. and balance [96], [97]. In contrast, others apply context and
1) Two-stage object detection. 2) One-stage object detection. attention mechanisms. Specific models also employ different
The anchors serve as regression references and classification learning strategies and loss functions.
candidates for predicting proposals for two-stage detectors
and final bounding boxes for one-stage detectors.
1) REGION-BASED CONVOLUTIONAL NEURAL
NETWORKS (R-CNN)
A. TWO-STAGE METHODS Instead of processing many regions, the R-CNN [85] model
Region-based object detection algorithms were among the developed by R. Girshick et al. in 2014 proposes many boxes
most widely used techniques for detecting objects in images. in the image and checks whether one contains an object. The
The first models in object detection start intuitively by R-CNN relies on a selective search [83] method developed by
searching the regions and then performing the classification. J.R.R. Uijlings et al. in 2012, a variant of the exhaustive image
The two-stage methods are derived from the R-CNN meth- search to extract these boxes from an image. These boxes are
ods that extract RoI using a selective search method [83] called regions. Selective search is used to extract 2000 region
and then classify and regress them. Faster R-CNN [76] is proposals; those candidate region proposals are cropped and
the most well-known two-stage anchor-based detector ref- resized to fit the input of the CNN feature extractor, where
erence. It uses a separate region proposal network (RPN) they extract a 4096-dimensional vector of features transmit-
that generates RoI by modifying predefined anchor boxes ted into several classifiers for class prediction. SVMs [50] are
assigned to each class to classify the object’s occurrence in Thus, a single CNN is applied to the entire image rather
the proposed candidate region under a given feature vector. than 2000 CNNs on 2000 regions. The SVM is also changed
It also has a linear regressor that predicts four offset values to to a softmax layer, extending the neural network as a predic-
enhance the selected bounding boxes’ accuracy and minimize tion model rather than building a new one. The primary CNN
localization errors. The R-CNN consisted of three simple with several convolutional layers takes the entire image as
steps: Scan the input image to detect objects that may be input rather than applying a CNN for every region proposal.
present using the selective search algorithm by proposing As a result, the region proposals are based on the last feature
approximately 2000 candidate boxes. Then for every candi- map. Therefore, they can build a single CNN for the entire
date box, we apply a CNN for feature extraction. The result of image. Regions of interest (RoI) are detected by applying the
each CNN is transmitted to an SVM for classification and for selective search method to the feature maps produced. The
a linear regressor to refine the object’s bounding box. R-CNN proposal region is formally resized using an RoI pooling layer
is very easy to use but very slow. to obtain a valid region of interest that can be introduced
into a fully connected layer. Fast R-CNN uses a softmax
2) SPATIAL PYRAMID POOLING NETWORK (SPPNet)
layer instead of many different SVMs to predict the class
directly for each region proposal and the offset values for the
Based on the concept of spatial pyramid matching
bounding boxes. Therefore, we have only one neural network
(SPM) [98], SPPNet [68] is mainly an improved version
to train, compared to one neural network and many SVMs.
of R-CNN [85]. SPPNet has implemented a specific CNN
Fast R-CNN uses a multi-task loss function that combines
procedure known as Spatial Pyramid Pooling (SPP) during
classification and regression losses. The classification loss
the passage of the convolutional layer and the fully connected
is computed using the log loss function over two classes.
layer. In transitioning from the convolutional layer to the fully
The regression loss is computed using the L1 smooth loss
connected layer, it proposes having multiple pooling layers at
function.
different scales instead of a single pooling layer often used as
a standard in other methods. A selective search algorithm is 4) FASTER REGION-BASED CONVOLUTIONAL
applied by SPPNet to generate about 2,000 region proposals NETWORK (FASTER R-CNN)
per image, just like R-CNN. Next, it extracts features directly
The three algorithms mentioned above, R-CNN, SPPNet, and
from the whole image using ZFNet [99] only once. At the
Fast R-CNN, are based on a selective search to identify region
final conv layer, the feature maps delineated by each region
proposals. Selective search is a slow and time-consuming
proposal pass through the SPP layer, followed by the fully
method that impacts network performance and was proven to
connected layer. Each bounding box has its SVM and bound-
be the bottleneck of the entire process. Thus, the authors of
ing box regressor. SPPNet uses the SPP for every region pro-
Faster R-CNN [76] proposed a framework for object detec-
posal to pool the features of that region from the global feature
tion to replace the selective search algorithm and allow the
volume to produce its fixed-length representation. SPP solves
network to discover region proposals. The Faster R-CNN
the problem of cropping the image before entering CNN at a
point was that the region proposals depended on the image
fixed size, as in R-CNN with VGG [58], where image sizes
features previously calculated with the CNN forward passage
are fixed (224*224). Thus, with SPP, the images can be used
(the first step in the classification). They have developed
with different shapes. In contrast to R-CNN, SPPNet only
a region proposal network (RPN) [76] to generate region
deals with the image at the convolutional layers once, whereas
proposals directly, then predict bounding boxes. An RPN and
R-CNN deals with the image at the convolutional layers at
Fast R-CNN model combined in Faster R-CNN [58]. Faster
least 2000 times. As shown in Table 4, SPPNet is much faster
R-CNN takes the CNN feature maps and forwards them to the
and more accurate than R-CNN.
region proposal network. RPN utilizes a 3×3 sliding window
that moves across these feature maps. Each sliding window
3) FAST REGION-BASED CONVOLUTIONAL location generates multiple potential regions and scores based
NETWORK (FAST R-CNN) on k fixed-ratio anchor boxes. Now we have bounding boxes
To solve some of the problems of R-CNN and SPPNet and in various shapes and sizes passed to the RoI pooling layer.
to develop a faster object detection algorithm, Ross Girshick Consequently, it is possible that region proposals may have no
published a new paper named Fast R-CNN [84]. Comparing classes assigned to them after the RPN step. So, we can crop
Fast R-CNN with SPP-net, one can observe that the SVM each proposal to make each proposal region include an object.
classifiers have been removed, and a regression and classi- That is what the RoI pooling layer is for. It extracts fixed-size
fication layer has been connected to the network. VGGNet feature maps for each anchor. These feature maps are then
is used instead of ZFNet, the region of interest (RIO) poling transmitted into a fully connected layer comprising a softmax
layer, rather than the SPP. On the other hand, Fast R-CNN and a linear regression layer. It then classifies the objects and
is similar to the original R-CNN in many ways. However, predicts the bounding boxes for the detected objects. Only
two major additions have improved its detection speed: They one CNN is applied in Faster R-CNN for region proposals
extract the image features before proposing regions rather and classification. Faster R-CNN is optimized for a multitask
than forwarding the region proposals to the feature extractor. loss function comprising classification and regression loss.
The Region Proposal Network (RPN) is a Convolutional proposals will use the same score maps to carry out the
Neural Network that proposes regions. At the same time, the average voting, a simple computation. Consequently, there is
second network is a Fast R-CNN for feature extraction and no learning layer after the ROI layer; in other words, R-FCN
outputting the Bounding Box and Class Labels. The RPN is is significantly faster than Faster R-CNN and has a highly
optimized for the given multitask loss function. reliable mAP. R-FCN operates as follows; the input image
is processed by the backbone ResNet-101 [59] to generate
5) REGION-BASED FULLY CONVOLUTIONAL feature maps. These feature maps are transmitted on the one
NETWORK (R-FCN) hand to an RPN to produce RoI and, on the other hand, to a
R-CNN-based detectors, such as Fast R-CNN or Faster fully convolutional layer for generating a bank of position-
R-CNN, detect objects in two phases. First, generate region sensitive score maps. To have a score map, k2 (C + 1), where
proposals (ROI) and classify and localize objects from k2 is defined as the number of relative positions used to
the ROI. These detectors save valuable time by sharing split an object in a grid. C + 1 is defined as the number of
calculations of repeated convolutional features for object classes with a background. Afterward, on each ROI, they split
classification and region proposals. However, Faster R-CNN it into the exact k2 boxes or sub-regions as the scorecards.
still contains several unshared R-CNN’s fully connected lay- They check the score bank for each bin to ensure that it
ers that must be calculated for each of the hundreds of corresponds to the respective position of the object. In the
proposals. The Region-based Fully Convolutional Network upper left bin, for instance, we will search for the score maps
(R-FCN) [100] is a framework that combines the two main that match the upper left corner of an object and average
phases in a single model to take into account both the detec- these values in the RoI region. The system performs this
tion of the object and its position simultaneously. It contains procedure for each class all over again. After each k2 bin has
only convolutional layers that provide complete backpropa- a corresponding object value in each class, the k2 bins are
gation for training and inference. As we have observed in averaged to produce a unique score per class. They classify
the methods mentioned above, that region proposal is mainly the RoI with a softmax on the remaining dimensional vector
generated by RPN. The ROI pooling is performed and passes C + 1. They use convolution filters for the regression of
across fully connected (FC) layers to classify and regress the selection framework to generate the k × k × (C + 1)
the bounding boxes. The post-ROI pooling is not shared score maps used for classification purposes. An additional
between ROIs and takes a long time. As a result, the FC convolution filter generates a four × k × k map based on the
layers add more parameters to the model, which leads to same feature maps. The loss function for R-FCN is defined
more complexity. In R-FCN, there is still RPN for region on each RoI and is the summation of the cross-entropy loss
proposals. However, unlike the R-CNN series, the FC layers and the box regression loss. The classification loss (Lcls) and
after ROI pooling are eliminated. As an alternative, the objec- bounding box regression loss (Lreg) are used in online hard
tive complexity is moved before the ROI pooling to generate example mining (OHEM).
feature maps, each dedicated to detecting a category at a
specific location. For instance, a feature map is dedicated to 6) FEATURE PYRAMID NETWORKS (FPN)
detecting a dog, another for detecting a car, Etc. These feature The FPN [63] is not an object detector in itself. It is a feature
maps rely on an object’s spatial localization, called position- detector that operates in combination with object detectors.
sensitive score maps. After the ROI pooling, all these regions’ For instance, with FPN, we can extract multiple feature map
layers and feed them into an RPN to detect objects. Compared architecture and apply scale-aware training, where each
to the feature extractor used in some frameworks like Faster branch shares the same transformation parameters but with
R-CNN, FPN generates more layers of feature maps, multi- different receptive fields. The model applies a fast infer-
scale feature maps, and high-quality information than the ence method with only one major branch to boost the
standard feature pyramid used for object detection. Using model performance without using additional parameters and
FPN allows us to detect objects on various scales. The FPN computations. The authors TridentNet achieved an mAP
consists of a bottom-up and a top-down pathway. The bottom- of 48.4 on the MS-COCO dataset with Resnet-101 as a
up pathway is the traditional convolution network for feature backbone.
extraction and uses a ResNet [59]. The spatial resolution
decreases as we move upwards; as more high-level features 9) SPINENET
are detected, each layer’s semantic value is enhanced. As a SpineNet [61] is a classification and object detection
reference set of feature maps, the output of the last layer model that uses Neural Architecture Search for learning
of each stage will be used to enhance the top-down path- in contrast to traditional encode-decoder architectures with
way through the lateral connection. The top-down pathway scale-decreased backbone leading to ineffectiveness in gen-
allows for higher-resolution layers from a semantic-rich layer. erating multi-scale features. The SpineNet proposed method
Whereas the reconstructed layers are semantic, the locations has a fixed stem network followed by scale-permuted inter-
of the objects after all sub-sampling and bottom-up sam- mediate features and cross-scale connections. The authors
pling are inaccurate. The authors include lateral connections proposed many variants of SpineNet, such as SpineNet-49,
between the reconstructed layers and the associated feature SpineNet-143, and SpineNet-190. The latter obtained an AP
maps to address this problem to predict the most appropriate of 52.2% on the MS-COCO dataset.
locations. In the top-down pathway, an oversampling by a
factor of 2 is performed on the spatial resolution using the 10) COPY-PASTE
nearest neighbor to simplify the process. For each lateral
In [101], the authors applied the copy-paste data augmenta-
connection, feature maps of the same spatial size are merged
tion strategy and proved its effectiveness for object detection
from the bottom-up and top-down pathways. In more detail,
and instance segmentation. The copy-paste method chooses
the feature maps of the bottom-up pathway are convolved at
two images randomly and applies a random scale jittering and
1 × 1 convolutions to minimize the channel size. Moreover,
a horizontal flip. It generates new data by pasting objects
the bottom-up and top-down feature maps are combined by
from one image to another. In the final stage, they tune the
element-wise addition. Then, a 3 × 3 convolution is applied
ground truth annotations for the bounding boxes by eliminat-
directly to each merged map to compute the final feature map,
ing all occluded objects. The authors provide a self-training
designed to minimize the frequency folding effect of over-
Copy-paste where a supervised model is trained on labeled
sampling. The final set of feature maps is called P2, P3, P4,
data, producing pseudo labels on unlabeled data. Combined
P5, which refers to C2, C3, C4, C5, and both have the same
with Cascade Eff-B7 NAS-FPN, they achieved an AP of
spatial size, respectively.
55.9% on the MS-COCO dataset.
7) PANET
B. COMPARISON: TWO-STAGE DETECTORS
The Path Aggregation Network (PANet) [66] is a method
mainly developed, for instance, segmentation, which inserts Table 4 lists a chronological comparison of the strengths and
an additional upward path aggregation network above FPN. limitations of the two-step anchor-based detection methods
They provide an adaptive feature pooling that shortens the mentioned earlier in this paper.
distance between the lower and topmost feature levels by
grouping the features of all feature levels and avoiding arbi- C. ONE-STAGE METHODS
trarily assigned outputs. PANet allows the network to decide One-stage anchor-based detectors are characterized primarily
which features are useful. They use a complementary path to by their computational and runtime efficiency. These mod-
enhance the feature of each proposal by providing accurate els directly classify and regress predefined anchor boxes
localization signals in lower layers and generating a bottom- instead of using regions of interest. The SSD was this cate-
up augmentation. The PANet obtained an accuracy of 41.4 on gory’s first well-known object detector [73]. The major chal-
the MS-COCO dataset compared to Mask R-CNN, which lenge encountered in this type of detector is the imbalance
achieved only 36.4%. PANet uses ResNeXt-101 as a between positive and negative samples. Several approaches
backbone. and mechanisms have been implemented to overcome this
problem, such as anchor refinement and matching [102],
8) TRIDENTNET training from scratch [103], [104], multi-layer context infor-
The TridentNet model [89] proposes an approach to deal mation fusion [105], [107], and feature enrichment and align-
with the scale variations in object detection based on gener- ment [69], [108], [111]. Other works have been directed
ating in-network scale-specific feature maps using uniform toward developing new loss functions [79], [112] and new
representational power. They build a parallel multi-branch architectures [113], [114].
1) YOLOv2 the object. For this purpose, the one with the highest IoU
YOLOv2, or YOLO9000 [80], published in 2017, is an object (intersection over union) with the ground truth is selected.
detection model capable of detecting more than 9,000 object YOLOv2 loss function has three parts: finding bounding-box
categories in real-time. It has many updated features to coordinates, bounding-box score prediction, and class-score
fix the problems of the first version. The main improve- prediction. All of them are Mean-Squared error losses and
ments in YOLOv2 compared to YOLOv1 [72] are the are modulated by some scalar meta-parameter or IoU score
application of batch normalization over the entire convolu- between the prediction and ground truth.
tional layers. Besides training 224 × 224 images, it uses
448 × 448 images to fine-tune the classification network over 2) YOLO v3
ten periods on ImageNet [115]. Using 416 × 416 images The YOLO [72] algorithm uses a softmax function to convert
during training eliminates a pooling layer for better output the scores into probabilities equal to one. YOLOv3 [116]
resolution, removes all fully connected layers, and replaces applies a multi-label classification, and the softmax layer is
them with anchor boxes for predicting bounding boxes. substituted with an independent logistic classifier to calculate
The model achieved 69.2% mAP and 88% recall with the the input’s probability of being part of a particular label.
anchor boxes; without them, it achieved 69.5% mAP and Rather than applying the mean square error to compute the
81% recall. Although the mAP is slightly reduced, its recall classification loss, YOLOv3 applies a binary cross-entropy
has a high margin increase. As with Faster R-CNN [76], loss for every label. In addition, it minimizes the cost com-
the anchor box sizes and scales were pre-set beforehand. plexity of calculations by bypassing the SoftMax function.
YOLO9000 relies on k-means clustering to achieve inter- It provides additional minor enhancements. It performs pre-
esting IOU scores because the standard Euclidean distance- diction at three scales, precisely by downsampling the input
based k-means often have additional errors when dealing image dimensions by 32, 16, and 8, respectively. Darknet,
with larger boxes. Using an IoU clustering approach with in this version, has been extended to include 53 convolutional
nine anchor boxes, Faster R-CNN obtained 60.9%, whereas layers. Detections in several layers are a good solution for
YOLO900 achieved 67.2%. For YOLOv2, the location is solving the problem of small object detection, a common
defined by the logistic activation, thus reducing the value concern with YOLOv2. YOLO v3 uses a total of 9 anchor
between 0 and 1, compared to YOLOv1, which has no con- boxes. Three per each scale. It relies on K-Means clustering to
straints on the location prediction. YOLOv2 predicts multiple generate all nine anchors. Next, the anchors are identified in
bounding boxes per grid cell. To compute the loss for the descending order of one dimension. The first scale allocates
true positive, only one of them should be responsible for the three most prominent anchors, the second assigns the
following three anchors, and the third one the last three. and more stable training. They select the negative examples
YOLOv3 has more bounding boxes predicted than YOLOv2. according to the highest confidence value assigned to each
For the same 416 × 416 image, YOLOv2 has 13 × 13 × default box and then select the high ones to ensure that the
5 = 845 boxes; at every grid cell, a total of 5 boxes were negative and positive ratio is below 3:1.
detected with the use of 5 anchors, as opposed to YOLO v3, The SSD loss function combines localization and confi-
which predicted boxes at three distinct scales, totaling 10,647 dence loss. The localization loss is the mismatch between
predicted boxes for an image with the size of 416 × 416. the ground truth box and the predicted boundary box. SSD
In other words, it predicts ten times more boxes than the only penalizes predictions from positive matches. Negative
total predicted by YOLO v2. For each scale, every grid can matches can be ignored. The confidence loss is a softmax
predict three boxes using three anchors. Since there are three loss over multiple confidence classes (c). During training, the
scales, nine anchor boxes are used. YOLOv3’s loss function set of default boxes and scales for detection is essential. The
of YOLOv3 is defined from three aspects: the bounding box SSD uses smooth L1 loss as its regression loss function. It is
position error, the bounding box confidence error, and the a particular case of Huber Loss with δ = 1. Smooth L1 loss
classification prediction error between the ground truth and combines L1 Loss and L2 Loss. When |a| is less than or equal
the predicted boxes. YOLOv3 predicts an objectness score for to 1, it behaves like an L2 loss. One-hot encoding turns the
each bounding box using logistic regression. The first aspect label y into a probability distribution.
of the loss function is the bounding box position error. The
error is calculated by summing up the squared differences
between predicted and true values of a bounding box’s x, y, 4) RETINANET
w, and h coordinates multiplied by a lambda coefficient that RetinaNet [79] is a single-stage object detector such as SSD
controls its importance to other losses. The second aspect is and YOLO that offers almost the same performance as two-
the bounding box confidence error which measures how con- stage detectors such as Faster R-CNN. This paper’s signif-
fident YOLOv3 is that there is an object in a given bounding icant contribution is a new loss function called a focal loss
box. This term uses binary cross-entropy loss to calculate for classification, which has significantly increased accuracy.
how well it predicts whether or not there is an object in a RetinaNet is a single, composite network consisting of the
given cell. Finally, classification prediction error measures leading backbone network called Feature Pyramid Net, which
how well YOLOv3 predicts an object’s class. It uses binary relies on ResNet (ResNet50 or ResNet101) and two task-
cross-entropy loss for each label. specific sub-networks. The backbone network calculates the
convolutional feature map for the entire input image. The
3) SSD first subnetwork is used to classify the output of the back-
Single Shot MultiBox Detector (SSD) [73] is an object detec- bone, while the second subnetwork network is used to per-
tion framework published after R-CNN and YOLO. It was form bounding box regression using the backbone’s output.
developed by W. Liu et al. to predict bounding boxes and Because of its fully convolutional structure, RetinaNet allows
class probability in a one-time process using an end-to-end the network to take an image of random size and generates
CNN architecture. It is typically faster than the faster R-CNN. feature maps with proportional sizes at several levels in the
The SSD allows a one-time shot to detect several objects in feature pyramid. In the classification sub-network, a fully
the image instead of the two shots required for the region convolutional network is associated with each level of FPN.
proposal network methods listed in the previous section. As a For each anchor A and K object class, it predicts how prob-
consequence, SSD is considerably more time-saving com- ably there will be objects in each spatial position. There are
pared to region-based approaches. An image is introduced as four 3 × 3 convolution layers with 256 filters in addition to
an input through a VGG-16 [58] network to extract feature ReLU activation [117]. A further 3 × 3 convolutional layers
maps. Several convolutional layers are added with different are applied with a K × A filter, followed by sigmoid activa-
filter sizes (19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1). These tion at the outputs. Focal loss is applied as a loss function.
and the 38 × 38 feature map produced by conv4_3 of VGG For the subnetwork, parameters are shared at all levels. As a
are the feature maps that 3 × 3 convolution filters will process result, the shape of the output feature map has the following
for each cell to make predictions. There are k-bounding boxes dimensions (W, H, KA), which correspond to the feature map
for each location in the feature maps. These k-boxes are of width and height, and K. A denotes the object’s class and
various sizes and aspect ratios. On each bounding box, we cal- anchor box values. The regression subnetwork is associated
culate the C class scores and four offsets about the original with each FPN feature map parallel to the classification sub-
shape of the default bounding box. Each box has four param- network. The regression subnetwork is designed the same
eters and a probability vector corresponding to the confidence way as the classification subnetwork; the only difference is
given to each object class. SSD involves negative sampling that the parameters are not shared, with the last convolution
to determine poor predictions. It applies the non-maximal layer consisting of 3 × 3 and 4 filters. Therefore, the output
suppression technique at the end of the model, like YOLO, feature map would be in the shape of (W, H, 4A).
to maintain the more appropriate boxes. Afterward, the Hard- RetinaNet utilizes a focal loss function to address class
Negative Mining (HNM) method is applied to ensure faster imbalance during training. RetinaNet’s focal loss function
down-weights the loss assigned to well-classified examples, They use a Cross-GPU batch normalization with 128 GPUs
focusing the training on a sparse set of hard examples and pre- and a warmup learning rate policy to train the whole network
venting many easy negatives from overwhelming the detector in a suitable time. MegDet achieved an mAP of 50.6% on
during training. the MS-COCO dataset using the ResNet-50 as a backbone
and the OHEM technique. They finished the model training
5) MEGDET in four hours. The MegDet paper does not describe a specific
MegDet [118] is a model that tackles the object detec- loss function by name. However, it mentions that the shape of
tion task from the batch size factor. The authors propose a the regression loss function (parameters of SmoothL1 Loss)
Large Mini-Batch size of 256 instead of 16 during training. is used in MegDet.
6) EFFICIENTDET 9) YOLOv7
EfficientDet [67] is an object detection model that relies YOLOv7 [233] is a faster and more accurate real-time
on the pretrained EfficientNet [60] backbones, a weighted for computer vision tasks. Like Scaled YOLOv4 [199],
bidirectional feature network, and a personalized compound YOLOv7 backbones do not use ImageNet pre-trained back-
scaling technique. The bidirectional feature network takes bones. YOLOv7 weights are trained using Microsoft’s COCO
the level features 3 to 7 from the efficient net and applies dataset, and no datasets or pre-trained weights are used. The
top-down and bottom-up bidirectional feature fusion. The official paper demonstrates how this improved architecture
class and box network weights are shared between all lev- surpasses all previous YOLO versions and all other object
els of features. EfficientDet7 achieved an AP of 52.2% on detection models in terms of speed and accuracy. YOLOv7
the MS-COCO dataset using the EfficientNet-B7 as a back- improves speed and accuracy by introducing several archi-
bone. EfficientDet uses a focal loss function for dense object tectural reforms. The larger models in the YOLO7 family
detection. However, the EfficientDet paper mentions that the are YOLOv7-X, YOLOv7-E6, YOLOv7-D6, and YOLOv7-
detection head and loss function are replaced with a seg- E6E. Other variations include YOLOv7-X, YOLOv7-E6, and
mentation head and loss function to perform segmentation YOLOv7-D6, which were obtained by applying the proposed
tasks. compound scaling method to scale up the depth and width of
the entire model.
7) PAA
PAA [119], a model based on a new technique to D. COMPARISON: ONE-STAGE DETECTORS
assign anchors based on the likelihood optimization of Table 5 lists a chronological comparison of the strengths and
the probability distribution, stands for probabilistic anchor limitations of the one-stage anchor-based detection methods
assignment. It consists of computing scores of anchors and mentioned earlier in this paper.
identifying positive and negative samples in a probabilistic
way compared to the heuristic IoU challenging assignment,
which makes the training process more difficult and time- VII. ANCHOR-FREE DETECTORS
consuming. The authors propose a score voting method for A. YOLOv1
post-processing in dense object detection. PAA achieved an YOLO [72] has a different approach to object detection.
AP of 50.8% on the MS-COCO dataset and 53.5% on the It captures the complete image in a single instance. It then pre-
same dataset for multi-scale testing. The authors tested the dicts both the coordinates of the bounding boxes for regres-
model with many backbones and obtained the best results sion and the class probabilities with only one network in one
using ResNeXt-32 × 8d-152-DCN. evaluation. Thus, his name is YOLO; you only look once.
The power of the YOLO model ensures real-time predictions.
8) YOLOv5 The input image is split into an SxS grid of cells to perform
YOLOv51 is a model in the You Only Look Once (YOLO) detection. A single grid cell is supposed to predict every
family. It is used for detecting objects and comes in four single object in the image, and this is where the object’s center
main versions: small (s), medium (m), large (l), and extra falls. Each cell will predict B potential bounding boxes with
large (x), each offering progressively higher accuracy rates. each bounding box’s C class probabilities value, with a total
YOLOv5 focuses on inference speed and accuracy, using of SxSxB boxes. Since the probability of most of these boxes
compound-scaled object detection models trained on the is relatively small, the algorithm excludes those boxes that
COCO dataset for model ensembling and Test Time Aug- fall below a minimum specific probability threshold. A non-
mentation. The algorithm only looks at an image once and maximal suppression procedure is applied to all left boxes,
detects all objects present and their location.YOLOv5 was removing all possible multiple detections and keeping the
introduced in 2020 by the same team that developed the most accurate objects. A CNN based on the GoogLeNet [120]
original YOLO algorithm as an open-source project. It builds model, which includes the initial modules, has been applied.
upon the success of previous versions and adds several new The network architecture includes 24 convolutional layers
features and improvements.YOLOv5 uses a Convolutional and two fully connected layers. The reduction layers of
Neural Network (CNN) backbone called CSPDarknet to form 1 × 1 filters, followed by convolutional 3 × 3 layers, replace
image features. These features are combined in the model the primary inception modules. As a result of the final layer,
neck, which uses a PANet (Path Aggregation Network) vari- a tensor of S * S * (C + B * 5) is obtained that equals
ant, and sent to the head. The model head then interprets the the predictions of each grid cell. The total estimate of prob-
combined features to predict the class of an image. It also abilities for each class is called C. The number of anchor
uses residual and dense blocks to enable information flow to boxes in each cell is indicated by B, with an additional four
the deepest layers. The architecture consists of three parts: coordinates and a confidence value for each cell. YOLO has
backbone, neck, and head. three loss functions, one for the abjectness score and two
others for the coordinates and classification errors. The latter
is calculated when the abjectness score is greater than 0.5.
1 https://github.com/ultralytics/yolov5 YOLOv1 loss function is divided into three parts: the one
responsible for finding the bounding-box coordinates, the a purely geometric way. The model uses the Hourglass-104
bounding-box score prediction, and the class prediction. The as a backbone and obtained an accuracy of 43.7% and 40.2%
final loss function is a sum of these three parts. on the MS-COCO dataset for the single-scale and multi-scale
testing, respectively. The ExtremeNet paper does not describe
B. CORNERNET a specific loss function by name.
CornerNet [74] is an object detection model that uses key
points to detect the object bounding box. It uses a convolu- D. REPPOINTS
tional neural network to detect objects as paired keypoints RepPoints [78] stands for representative points, a technique
from the top-left and bottom-right corners. Those corners representing objects as a set of sample points. Since the
are represented as heatmaps, one for the top-left corners and traditional bounding boxes provide a coarse localization
the other for the bottom-right corner. Each corner has only and extraction, RepPoints use points to localize and iden-
one ground truth positive location, while all the remaining tify objects. The reppoint technique does not use anchors
locations are identified as negative. This technique prevents to sample the space of bounding boxes. Instead, it learns
the model from using traditional anchors employed in other to automatically process the ground truth localization and
object detectors. The authors also propose a new type of recognition targets by limiting the spatial extent within
pooling layer named corner pooling that aims to localize an object and identifying the semantically relevant local
corners efficiently. CornerNet uses the Hourglass-104 back- areas. The authors proposed object detection model is
bone and achieved an accuracy of 40.5% in the MS-COCO RPDet [78] based on the RepPoints representation combined
dataset and 42.1% using multi-scale training in the same with deformable convolution. RPDet used ResNet-101-DCN
dataset. CornerNet uses associative embedding, where the as a backbone and obtained an accuracy of 42.8% and 46.5%
network predicts similar embeddings for corners belonging in multi-scale training and testing. RepPoints paper describes
to the same object and a loss function similar to the triplet two sets of RepPoints, one driven by the points distance loss
loss. In addition, it proposes a new variant of focal Loss as a alone and the other by a combination of the points distance
loss function, which dynamically adjusts the weights of each loss and the center-ness loss.
anchor box.
E. FSAF
C. EXTREMENET The authors propose a Feature Selective Anchor-Free (FSAF)
ExtremeNet [121] uses a bottom-up approach to detect module [122] to solve two problems faced in anchor-based
objects. They use a standard keypoint estimation network single-shot detectors with feature pyramids; the heuristic-
to identify the object’s center point and its four extreme guided feature selection and the overlap-based anchor sam-
points: top, right-most, left, and bottom-most. These four pling. While training multi-level anchor-free branches, the
extreme vital points are used as the object bounding box in FSAF module applies online feature selection while training
F. FCOS
In addition to being an anchor-free detector, the Fully Convo- FIGURE 8. The percentage of object detection models published each
lutional One-Stage Detection (FCOS) [75] is also a proposal- year.
backbones, such as Resnet-50, Resnet-50-DCN, ResNeXt- However, the ViT model outputs hidden raw states without
101-64 × 4d-DCN, and the Swin-S. With the Swin-S back- any specific head on top. It can be used as a building block for
bone, they achieved remarkable results on the MS-COO various computer vision tasks such as image classification.
dataset, reaching 49.2%. DSLA improves the performance of
detection models with adaptive label assignment algorithms B. DETR
and lower bounding box losses for those positive samples The DEtection TRansformer (DETR) presented in [130] is
indicating more samples with higher-quality predicted boxes the first end-to-end object detection model based on trans-
are selected as positives. formers. It consists of a pretrained CNN backbone and trans-
former. The model uses Resnets as a backbone to generate the
J. YOLOv8 lower dimensional features, which will then be formatted into
YOLOv82 is a state-of-the-art object detection, image clas- a single set of features and added to a positional encoding,
sification, and instance segmentation model developed by fed into a Transformer. The transformer creates an end-to-end
Ultralytics. It is designed to be fast, accurate, and easy to trainable detector. The transformer is based on the original
use. YOLOv8 builds upon the success of previous YOLO transformer [131]. It consists of an Encoder and a Decoder,
versions and introduces new features and improvements to removing hand-crafted modules like anchor generation. The
boost performance and flexibility further. It can be trained transformer encoder takes image features and position encod-
on large datasets and run on various hardware platforms, ings as input and directs the result to the decoder. The decoder
from CPUs to GPUs. One key feature of YOLOv8 is its processes those features and transmits the output into a fixed
extensibility. It supports all previous versions of YOLO, number of Prediction Heads, which consist of a predefined
making it easy to switch between different versions and number of feed-forward networks. Each prediction head’s
compare their performance. This makes YOLOv8 an ideal output has a class and bounding box. Multi-head attentions
choice for users who want to take advantage of the latest in the decoder modify these object queries with encoder
YOLO technology while still being able to use their existing embeddings to generate results passed through multi-layer
YOLO models.YOLOv8 includes numerous architectural and perceptrons to predict class and bounding boxes. DeTR uses
developer-convenience features, making it an appealing bipartite matching loss to find the optimal one-to-one match-
choice for a wide range of object detection and image seg- ing between detector output and padded ground truth. DETR
mentation tasks. The architecture of YOLOv8 changed from generates a predefined number of predictions, each computed
a simple version to a more complex one, with new con- in parallel. DETR proposes a set-based global loss that forces
volutional layers and a new detection head. Compared to unique predictions via bipartite matching. The DETR model
YOLOv5, it replaces the C3 module with the C2f module. approaches object detection as a direct set prediction problem
and consists of a set-based global loss, which is the sum of the
K. COMPARISON: ANCHOR-FREE DETECTORS classification loss and the bounding box regression loss.
Table 6 lists a chronological comparison of the strengths
and limitations of the anchor-free object detection methods C. SMCA
mentioned earlier in this paper.
The SMCA model [132], published in 2021, was an alterna-
tive to improve the DETR model convergence. To train DETR
VIII. TRANSFORMER-BASED DETECTORS
from scratch, it needs about 500 epochs to achieve the best
A. VIT
results. SMCA proposes a mechanism called Spatially Modu-
ViT, published in [127] and inspired by transformers in NLP lated Co-Attention to improve the convergence of DETR. The
tasks [128], [129], was the first object detection model to SMCA model only replaces the co-attention mechanism in
apply transformers directly to images instead of combining the DETR decoder by applying location-aware co-attention.
convolutional neural networks and transformers. ViT splits This new feature constraints co-attention responses to be
the image into patches by providing the sequence of linear high near initially estimated bounding box locations. Training
embeddings of these patches as an input to a Transformer. SMCA takes only 108 epochs and achieves better results than
The model processes the patches as a sequence of words like the original DETR, and demonstrates potential processing of
tokens processed in Natural Language Processing. A constant global information.
latent vector is used to flatten and map the patches to the
vector size dimension with a trainable projection in all the
transformer layers. They used an MLP with one hidden layer D. SWIN
during the classification during the pre-training time and one The Swin Transformer [133] seeks to provide a transformer-
single layer at the fine-tuning time. The ViT achieved the based backbone for computer vision tasks. The word Swin
highest performance when trained on larger datasets when stood for Shifted window and was the first time to apply the
they were first published. The Vision Transformer (ViT) shifted window concept used in CNN in transformers. It uses
paper does not describe a specific loss function by name. patches as in the ViT model by splitting the input images
into multiple, non-overlapping patches and converting them
2 https://docs.ultralytics.com/ into embeddings. Numerous Swin Transformer blocks are
then applied to the patches in 4 stages. Each successive the absence of explicit physical meaning of learned object
stage reduces the number of patches to maintain hierarchical queries, which makes the optimization process difficult. The
representation, compared to ViT, which uses patches of one anchor points were used before in CNN-based detectors, and
size. These patches are converted linearly into C-dimensional applying this mechanism lets the object query focus on the
vectors. It computes self-attention only within the local win- objects near the anchor points. The Anchor DETR model
dow as the transformer block comprises local multi-headed can predict multiple objects at one position. To optimize
self-attention modules based on alternating shifted patch win- the complexity, they use an attention variant, Row-Column
dows in successive blocks. Computation complexity becomes Decoupled Attention, that reduces the memory cost without
linear with image size in local self-attention, while a shifted sacrificing accuracy. The primary model uses ResNet-101 as
window enables cross-window connection and reduces com- the backbone with a DC5 feature and achieves an accuracy
plexity. Each time the attention window shifts concerning the of 45.1% on MS-COCO with considerably fewer training
previous layer. Swin utilizes comparatively higher parameters epochs than DETR. The authors proposed anchor-free, RAM-
than convolutional models. free, and NMS-free variants.
query initialization. The authors propose a new Detection pairs of adjacent object queries in the decoder, they augment
Split Transformer that divides the content embedding esti- the self-attention by the spatial context of the other query in
mation of cross-attention into two independent parts, one for the pair.
the classification and the other for box regression embedding.
By doing this, they let each cross-attention deal with its
specific task. For the content query initialization, they use a G. COMPARISON: TRANSFORMER-BASED DETECTORS
mini-detector to learn the content and initialize the positional Table 7 lists a chronological comparison of the strengths and
embedding of the decoder. It is equipped with heads for clas- limitations of the two-step anchor-based detection methods
sification and regression embeddings. Finally, to account for mentioned earlier in this paper.
IX. PERFORMANCE ANALYSIS AND DISCUSSION and rich annotations allow us to evaluate the models on a wide
This section tests and compares all object detection models range of images and give a clear picture of how the models
in the three benchmark databases in the object detection generalize. Tables 6 and 7 show that all the models that
field. Pascal Voc 2007, Pascal Voc 2012 and MS-COCO. The achieved the highest mAP on Pascal VOC 2007 fall into the
column ‘‘data’’ in the following tables refer to training data. anchor-based detectors. All the leading five models belong
to the two-stage approach, except for ScratchDet++, which
A. PASCAL VOC 2007 follows the one-stage approach. Copy-Paste achieved an mAP
The results of the tests are listed in Table 8. of 88.6% by combining EfficientNet-B7 and NAS-FPN as
a backbone. Moreover, it reached an mAP of 89.3% when
B. PASCAL VOC 2012 pre-training on MS-COCO. Copy and paste highlights the
The results of the tests are listed in Table 9. importance of copy-and-paste data augmentation. SNIPER,
ScratchDet, ACoupleNet, and Faster R-CNN achieved the
C. MS-COCO following mAPs: 86.9%, 86.3%, 85.7%, and 85.6%. Except
The results of the tests are listed in Table 10. for the Copy-Paste model, which uses EfficientNet-B7 NAS-
FPN as its backbone, all other leading models use one of the
D. TESTING CONSUMPTION following networks: ResNets, Root-Resnets, and VGGNets,
All the frameworks listed below are tested using the proving the powerful performance of these models.
Nvidia Titan X GPU (Maxwell architecture) for all exper- From Table 8, we remark that anchor-based detection
iments, facilitating speed comparison with earlier experi- methods are the models that scored the best mAPs on
ments, as they used the same GPU. Pascal VOC 2012. We also notice that the one-stage anchor-
based detectors surpass the two-stage anchor-based detec-
1) PASCAL VOC07 tors, which was the opposite in the past. RefineDet512++
The results of the tests are presented in Table 11. achieved the best mAP of 86.8% with pretraining on the MS-
COCO dataset using VGGNet-16. In contrast, the highest
2) MS-COCO mAP without pretraining on MS-COCO belongs to Reti-
The results of the tests are presented in Table 12. naNet500 using AP-Loss and ResNet-101 as the backbone,
with an mAP of 84.5% when applying the multi-scale testing.
E. DISCUSSION ScratchDet300+, FSSD512, and BlitzNet obtained an mAP
As we can observe in this survey, most of the tests were of 86.3%, 84.2%, and 83.8%, respectively. Similar to Pascal
performed on the MS-COCO database. Indeed, its large size VOC 2007 results, the main backbones that achieved the best
results were VGG networks, residual networks, and Root- its backbone. In fourth place, we have the EfficientDet-
ResNets. D7x model, which achieved an mAP of 55.1% and used the
For the models tested on the MS-COCO dataset, we can EfficientNet-B7 network as its backbone. EfficientDet-D7x
notice the intense competition between different approaches. belongs to the one-step anchor-based object detector family.
The first four positions belong to different object detection In MS-COCO, the backbones that assisted in achieving an
approaches. So far, the Swin V2-G model based on trans- mAP greater than 50.0% are ResNets, ResNeXts, Efficient
formers and the HTC++ backbone is the winner, with an Nets, SpineNet, CSP, and HTC++.
mAP of 63.1%. Ranking second, we find Copy-Paste, which Table 11 shows that all the fast object detection algorithms
belongs to the anchor-based model family, with an mAP belong to the one-stage anchor-based approach family when
of 56.0%. Copy-Paste uses a combination of Cascade Eff- implementing object detection models in a real-time environ-
B7 and NAS-FPN. In third place, we find YOLOv4-P7, ment. However, achieving high accuracy with many frames
which falls into the anchor-free detector family with an per second is difficult, as in the case of Fast YOLO, which
mAP of 55.5%. YOLOv4-P7 uses the CSP-P7 network as achieved 155 FPS while obtaining only 55.7% mAP. We can
spot, for example, that a model like EFIPNet managed to 7 FPS, respectively. Therefore, we conclude that the anchor-
have a balance. EFIPNet achieved an mAP of 80.4% and an based one-step detectors are still the fastest.
impressive FPS of 111 and used VGGNet-16 as its backbone. Figure 5, shows the evolution of the accuracy in the
RefineDet320 achieved an mAP of 80.0% and 40 FPS and three datasets: VOC07, VOC21, and MS-COCO, between
used VGGNet as a backbone. 2013 and 2022. The figure also displays the winning detec-
According to Table 12, we can observe that all the tion model for each year within each dataset. For VOC07
fast object detection models belong to the anchor-based and VOC12, the accuracy is presented by mAP, while for
single-step object detection models. In addition, we can MS-COCO, it is by mAP [.5,.95]. The chart shows that the
see that some models have successfully balanced detec- accuracy has evolved in VOC07 from 58.5% in 2013 through
tion accuracy and runtime speed. For example, YOLOv4, the Model R-CNN BB to 89.3% in 2021 through the Copy-
which uses CSPDarknet-53, achieved an mAP of 41.2% Paste model. This means an increase of more than 30%. The
with 54 FPS. EfficientDet-D2, which uses the Efficient-B2 same in VOC12, with an increase in accuracy of over 33%
backbone, achieved an mAP of 43.0% with 41.7 FPS. Fur- during the same period. While in MS-COCO, there was an
thermore, no two-stage object detector model has performed improvement in accuracy of 40% between 2015, with a value
well in real-time. (FPS > 30). RDSNet has 17 FPS and of 23.6 through ION and 63.1 in 2022 through the SwinV2-G
an mAP of 36.0%. In comparison, the anchor-free detec- model. We also note that accuracy is improved every year in
tors such as CornerNet or ATSS could only attain 4.4 and the MS-COCO dataset. Thus, for example, in VOC12, the
accuracy has stayed the same since 2017, remaining at the The anchor-based two-stage increased by 26%, starting with
value of 86.8% realized by RefineDet. Likewise, in VOC07, ION with an accuracy of 33.1 in 2015, reaching 59.1 in
the accuracy has only increased by 2.4% since 2018 with the 2021 with the SoftTeacher model. For the anchor-based one-
introduction of Copy-Paste. stage detectors, in 2016, SSD achieved an accuracy of 28.8%,
Figure 6 shows the evolution of different types of object and in 2021 DyHead achieved an accuracy of 87.7%, repre-
detection models in the MS-COO dataset between 2015 and senting an enhancement of 30%. DetNet101, a model of the
2022. It can be seen that anchor-based two-stage models were anchor-free detector family, reached an accuracy of 33.8%
the first to be evaluated in MS-COO in 2015, followed by in 2017, and in 2021 YOLOv4-P7 increased the accuracy by
anchor-based one-stage in 2016, anchor-free in 2017, and more than 21%, reaching 55.5%. The most recently published
transform-based in 2020. So far, the most successful fam- transformer-based detectors achieved the best results with
ily is the transform-based with SwinV2-G, followed by the SwinV2-G in 2022 with an accuracy of 63.1%, while the
anchor-based two-stage with SoftTeacher, then the anchor- first pure model based on transformers, DETR, achieved only
based one-stage with DyHead, and finally, the anchor-free 44.9% in 2020.
one-stage detectors with YOLOv4-P7. We note a difference Figure 7 illustrates the number of detection models evalu-
of more than 7% between the best transformer-based detector, ated in MS-COCO by each detector family between 2015 and
SwinV2-G, and the best anchor-free detector, YOLOv4-P7. 2022. We find that 2018 was the most productive year with
more than 30 published models, of which half were anchor- These objects may be so small that they are barely vis-
based two-stage models, and the other half were anchor- ible or partially occluded by other objects in the scene.
based one-stage methods, and with the publication of only Tiny object detection has many potential applications,
one anchor-free model. We also notice that anchor-based two- such as detecting small animals in wildlife monitoring,
stage methods dominated the literature between 2015 and identifying minor defects in manufacturing processes,
2018 with more than 36 published models, whereas between and medical imaging.
2018 and 2020, more than 36 anchor-based one-stage models 3) 3D object detection: With the increasing availabil-
were published. One can also spot that anchor-based mod- ity of 3D sensors, there is a growing interest in 3D
els have evolved from 2015 to 2018. After 2018 they start object detection. Unlike 2D object detection, which
losing proportion towards other detection families, such as estimates the location and size of objects in a two-
anchor-free and transformer-based detectors. For example, dimensional image, 3D object detection involves esti-
more than 15 different models of the anchor-based two-stage mating objects’ position, orientation, and dimensions
family were introduced in 2018, while just one year later, in three-dimensional space. 3D object detection can
only five models were released. In 2020, only two models be helpful in applications such as augmented real-
were released, while more than six anchor-free detectors were ity, robotics, and autonomous driving, where accurate
released in the same year. As soon as they appeared in 2020, knowledge of the 3D environment is necessary for
the transform-based detectors continuously expanded. navigation and interaction with the physical world.
Figure 8 shows that about half of the detection models 4) Multi-modal object detection: involves detecting
based on deep learning and evaluated in the MS-COCO objects from multiple visual and textual sources, such
dataset were introduced between 2018 and 2019. Then after as images, videos, and audio, enabling more com-
2019, the number of published models decreased yearly, with prehensive and accurate object detection in complex
a value of 14% in 2020, 11.6% in 2021, and 3.3% in 2022. scenarios. Multi-modal detection can be helpful in
applications such as autonomous driving, where mul-
X. CONCLUSION AND FUTURE DIRECTIONS tiple sensors detect objects around a vehicle.
In this paper, we presented an overview of the current 5) Few-shot learning: Few-shot learning is an area of
state of object detection based on deep learning. We have research that aims to develop algorithms to learn to
provided the most detailed survey covering dozens of detect objects from just a few examples. This is partic-
object detection models. We divided the models into four ularly useful when collecting large amounts of labeled
main approaches: two-stage anchor-based detectors, one- data is difficult or expensive. Those models will work
stage anchor-based detectors, anchor-free detectors, and with limited data or in low-resource settings.
transformer-based detectors. We tested and evaluated all Overall, the future of object detection using deep learning
models in major object detection databases such as Pas- is promising, with many exciting developments for future
cal VOC and MS-COCO. We determined that single-stage research.
detectors have improved and rival two-stage detectors’ accu-
racy. Furthermore, with the emergence of transformers in REFERENCES
vision tasks, transformer-based detectors have achieved peak [1] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
results, such as Swin-L and Swin V2, which achieved an mAP no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
of 57.7% and 63.1%, respectively, in the MS-COCO dataset. [2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
‘‘Object detection with discriminatively trained part-based models,’’
Object detection is an active area of research that is con-
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
stantly evolving, and there are several promising future direc- Sep. 2010, doi: 10.1109/TPAMI.2009.167.
tions that researchers are exploring. [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
with deep convolutional neural networks,’’ in Advances in Neural
1) Speed-accuracy trade-off: Increasing the accuracy of Information Processing Systems, vol. 25, F. Pereira, C. J. C. Burges,
an object detection algorithm requires more computa- L. Bottou, and K. Q. Weinberger, Eds. Red Hook, NY, USA: Curran
tional resources and longer processing times. Decreas- Associates, 2012, pp. 1097–1105. Accessed: Oct. 22, 2019. [Online].
Available: http://papers.nips.cc/paper/4824-imagenet-classification-
ing the accuracy can lead to faster processing times but with-deep-convolutional-neural-networks.pdf
lower detection performance. Therefore, researchers [4] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, ‘‘DeepDriving: Learn-
consistently aim to improve the accuracy and speed ing affordance for direct perception in autonomous driving,’’ in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, Dec. 2015,
of object detection algorithms by using more efficient pp. 2722–2730, doi: 10.1109/ICCV.2015.312.
architectures and training methods to enable real-time [5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, ‘‘Multi-view 3D object
and low-power applications, especially in complex detection network for autonomous driving,’’ 2016, arXiv:1611.07759.
scenes with occlusions or cluttered backgrounds. Accessed: Oct. 21, 2019.
[6] S. Ramos, S. Gehrig, P. Pinggera, U. Franke, and C. Rother, ‘‘Detecting
2) Tiny object detection: Tiny object detection is a spe- unexpected obstacles for self-driving cars: Fusing deep learning and
cific case of object detection focusing on detecting geometric modeling,’’ 2016, arXiv:1612.06573. Accessed: Oct. 21, 2019.
and localizing very small objects in images or videos. [7] J. Ni, K. Shen, Y. Chen, W. Cao, and S. X. Yang, ‘‘An improved
deep network-based scene classification method for self-driving
It remains challenging because extracting information cars,’’ IEEE Trans. Instrum. Meas., vol. 71, pp. 1–14, 2022, doi:
from small objects with only a few pixels is difficult. 10.1109/TIM.2022.3146923.
[8] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, ‘‘Abusive [28] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and
language detection in online user content,’’ in Proc. 25th Int. Conf. M. Pietikäinen, ‘‘Deep learning for generic object detection: A survey,’’
World Wide Web, Montreal, QC, Canada, Apr. 2016, pp. 145–153, doi: Int. J. Comput. Vis., vol. 128, pp. 261–318, 2020, doi: 10.1007/s11263-
10.1145/2872427.2883062. 019-01247-4.
[9] A.-M. Founta, D. Chatzakou, N. Kourtellis, J. Blackburn, A. Vakali, and [29] G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez,
I. Leontiadis, ‘‘A unified deep learning architecture for abuse detection,’’ ‘‘A survey of self-supervised and few-shot object detection,’’ 2021,
2018, arXiv:1802.00385. Accessed: Oct. 21, 2019. arXiv:2110.14711.
[10] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep learning face attributes in [30] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, ‘‘Salient object
the wild,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, detection in the deep learning era: An in-depth survey,’’ IEEE Trans.
Dec. 2015, pp. 3730–3738, doi: 10.1109/ICCV.2015.425. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3239–3259, Jun. 2022, doi:
[11] W. Liu, I. Hasan, and S. Liao, ‘‘Center and scale prediction: Anchor-free 10.1109/TPAMI.2021.3051099.
approach for pedestrian and face detection,’’ Pattern Recognit., vol. 135, [31] K. Tong and Y. Wu, ‘‘Deep learning-based detection from the perspec-
Mar. 2023, Art. no. 109071, doi: 10.1016/j.patcog.2022.109071. tive of small or tiny objects: A survey,’’ Image Vis. Comput., vol. 123,
[12] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, ‘‘Sequen- Jul. 2022, Art. no. 104471, doi: 10.1016/j.imavis.2022.104471.
tial deep learning for human action recognition,’’ in Human Behavior [32] X. Wu, D. Sahoo, and S. C. H. Hoi, ‘‘Recent advances in deep learning for
Understanding, A. A. Salah and B. Lepri, Eds. Berlin, Germany: Springer, object detection,’’ Neurocomputing, vol. 396, pp. 39–64, Jul. 2020, doi:
2011, pp. 29–39, doi: 10.1007/978-3-642-25446-8_4. 10.1016/j.neucom.2020.01.085.
[13] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, [33] N.-D. Nguyen, T. Do, T. D. Ngo, and D.-D. Le, ‘‘An evaluation of deep
‘‘Human action recognition from various data modalities: A review,’’ learning methods for small object detection,’’ J. Electr. Comput. Eng.,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3200–3225, vol. 2020, Apr. 2020, Art. no. e3189691, doi: 10.1155/2020/3189691.
Mar. 2022, doi: 10.1109/TPAMI.2022.3183112. [34] Y. Liu, P. Sun, N. Wergeles, and Y. Shang, ‘‘A survey and perfor-
[14] A. A. Cruz-Roa, J. E. A. Ovalle, A. Madabhushi, and F. A. G. Osorio, mance evaluation of deep learning methods for small object detec-
‘‘A deep learning architecture for image representation, visual inter- tion,’’ Expert Syst. Appl., vol. 172, Jun. 2021, Art. no. 114602, doi:
pretability and automated basal-cell carcinoma cancer detection,’’ in Med- 10.1016/j.eswa.2021.114602.
ical Image Computing and Computer-Assisted Intervention—MICCAI [35] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu, ‘‘A sur-
2013, C. Salinesi, M. C. Norrie, and Ó. Pastor, Eds. Berlin, Germany: vey of deep learning-based object detection,’’ IEEE Access, vol. 7,
Springer, 2013, pp. 403–410, doi: 10.1007/978-3-642-40763-5_50. pp. 128837–128868, 2019, doi: 10.1109/ACCESS.2019.2939201.
[15] A. B. Nassif, M. A. Talib, Q. Nasir, Y. Afadar, and O. Elgendy, ‘‘Breast [36] P. Viola and M. Jones, ‘‘Rapid object detection using a boosted cascade of
cancer detection using artificial intelligence techniques: A systematic lit- simple features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
erature review,’’ Artif. Intell. Med., vol. 127, May 2022, Art. no. 102276, Recognit. (CVPR), Kauai, HI, USA, Dec. 2001, pp. I-511–I-518, doi:
doi: 10.1016/j.artmed.2022.102276. 10.1109/CVPR.2001.990517.
[16] I. Lenz, H. Lee, and A. Saxena, ‘‘Deep learning for detecting robotic [37] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human
grasps,’’ Int. J. Robot. Res., vol. 34, nos. 4–5, pp. 705–724, Apr. 2015, detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
doi: 10.1177/0278364914549607. Recognit. (CVPR), San Diego, CA, USA, Jun. 2005, pp. 886–893, doi:
10.1109/CVPR.2005.177.
[17] Z. Zhou, L. Li, A. Fürsterling, H. J. Durocher, J. Mouridsen, and
[38] P. Felzenszwalb, D. McAllester, and D. Ramanan, ‘‘A discriminatively
X. Zhang, ‘‘Learning-based object detection and localization for a mobile
trained, multiscale, deformable part model,’’ in Proc. IEEE Conf. Comput.
robot manipulator in SME production,’’ Robot. Comput.-Integr. Manuf.,
Vis. Pattern Recognit., Anchorage, AK, USA, Jun. 2008, pp. 1–8, doi:
vol. 73, Feb. 2022, Art. no. 102229, doi: 10.1016/j.rcim.2021.102229.
10.1109/CVPR.2008.4587597.
[18] W. Wang, X. Wu, X. Yuan, and Z. Gao, ‘‘An experiment-based
[39] S. Ullman, M. Vidal-Naquet, and E. Sali, ‘‘Visual features of intermediate
review of low-light image enhancement methods,’’ IEEE Access, vol. 8,
complexity and their use in classification,’’ Nature Neurosci., vol. 5, no. 7,
pp. 87884–87917, 2020, doi: 10.1109/ACCESS.2020.2992749.
pp. 682–687, Jul. 2002, doi: 10.1038/nn870.
[19] G. Guo, H. Wang, C. Shen, Y. Yan, and H.-Y.-M. Liao, ‘‘Automatic [40] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, ‘‘Visual
image cropping for visual aesthetic enhancement using deep neural net- categorization with bags of keypoints,’’ in Proc. Workshop Stat. Learn.
works and cascaded regression,’’ IEEE Trans. Multimedia, vol. 20, no. 8, Comput. Vis. ECCV, May 2004, vol. 1, nos. 1–22, pp. 1–2.
pp. 2073–2085, Aug. 2018, doi: 10.1109/TMM.2018.2794262. [41] F.-F. Li and P. Perona, ‘‘A Bayesian hierarchical model for learning
[20] A. B. Amjoud and M. Amrouch, ‘‘Transfer learning for auto- natural scene categories,’’ in Proc. IEEE Comput. Soc. Conf. Comput.
matic image orientation detection using deep learning and logistic Vis. Pattern Recognit. (CVPR), vol. 2, San Diego, CA, USA, Jun. 2005,
regression,’’ IEEE Access, vol. 10, pp. 128543–128553, 2022, doi: pp. 524–531, doi: 10.1109/CVPR.2005.16.
10.1109/ACCESS.2022.3225455. [42] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman,
[21] K. Aurangzeb, S. Aslam, M. Alhussein, R. A. Naqvi, M. Arsalan, and ‘‘Discovering objects and their location in images,’’ in Proc. 10th IEEE
S. I. Haider, ‘‘Contrast enhancement of fundus images by employ- Int. Conf. Comput. Vis. (ICCV), vol. 1, Beijing, China, Oct. 2005,
ing modified PSO for improving the performance of deep learn- pp. 370–377, doi: 10.1109/ICCV.2005.77.
ing models,’’ IEEE Access, vol. 9, pp. 47930–47945, 2021, doi: [43] S. Agarwal and D. Roth, ‘‘Learning a sparse representation for object
10.1109/ACCESS.2021.3068477. detection,’’ in Computer Vision—ECCV 2002, A. Heyden, G. Sparr,
[22] W. Zhiqiang and L. Jun, ‘‘A review of object detection based on con- M. Nielsen, and P. Johansen, Eds. Berlin, Germany: Springer, 2002,
volutional neural network,’’ in Proc. 36th Chin. Control Conf. (CCC), pp. 113–127, doi: 10.1007/3-540-47979-1_8.
Jul. 2017, pp. 11104–11109, doi: 10.23919/ChiCC.2017.8029130. [44] H. Schneiderman and T. Kanade, ‘‘Object detection using the statistics of
[23] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, parts,’’ Int. J. Comput. Vis., vol. 56, no. 3, pp. 151–177, Feb. 2004, doi:
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, ‘‘Speed/accuracy 10.1023/B:VISI.0000011202.85607.00.
trade-offs for modern convolutional object detectors,’’ in Proc. IEEE [45] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeul-
Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, ders, ‘‘Segmentation as selective search for object recognition,’’ in Proc.
Jul. 2017, pp. 3296–3297, doi: 10.1109/CVPR.2017.351. Int. Conf. Comput. Vis., Barcelona, Spain, Nov. 2011, pp. 1879–1886, doi:
[24] S. Agarwal, J. O. D. Terrail, and F. Jurie, ‘‘Recent advances in object 10.1109/ICCV.2011.6126456.
detection in the age of deep convolutional neural networks,’’ 2018, [46] R. Lienhart and J. Maydt, ‘‘An extended set of Haar-like features for
arXiv:1809.03193. rapid object detection,’’ in Proc. Int. Conf. Image Process., Rochester,
[25] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep NY, USA, 2002, pp. I-900–I-903, doi: 10.1109/ICIP.2002.1038171.
learning: A review,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, [47] D. G. Lowe, ‘‘Distinctive image features from scale-invariant key-
no. 11, pp. 3212–3232, Nov. 2019, doi: 10.1109/TNNLS.2018.2876865. points,’’ Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004, doi:
[26] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, ‘‘Salient object 10.1023/B:VISI.0000029664.99615.94.
detection: A survey,’’ Comput. Vis. Media, vol. 5, no. 2, pp. 117–150, [48] H. Bay, T. Tuytelaars, and L. Van Gool, ‘‘SURF: Speeded up robust
Jun. 2019, doi: 10.1007/s41095-019-0149-9. features,’’ in Computer Vision—ECCV 2006, A. Leonardis, H. Bischof,
[27] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, ‘‘Object detection in 20 years: and A. Pinz, Eds. Berlin, Germany: Springer, 2006, pp. 404–417, doi:
A survey,’’ 2019, arXiv:1905.05055. 10.1007/11744023_32.
[49] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, ‘‘BRIEF: Binary [70] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional
robust independent elementary features,’’ in Computer Vision—ECCV block attention module,’’ in Computer Vision—ECCV 2018, V. Ferrari,
2010, K. Daniilidis, P. Maragos, and N. Paragios, Eds. Berlin, Ger- M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham, Switzerland:
many: Springer, 2010, pp. 778–792, doi: 10.1007/978-3-642-15561- Springer, 2018, pp. 3–19, doi: 10.1007/978-3-030-01234-2_1.
1_56. [71] X. Li, W. Wang, X. Hu, J. Li, J. Tang, and J. Yang, ‘‘Gener-
[50] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn., alized focal loss v2: Learning reliable localization quality estima-
vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018. tion for dense object detection,’’ in Proc. IEEE/CVF Conf. Com-
[51] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 11627–11636, doi:
of on-line learning and an application to boosting,’’ J. Comput. 10.1109/CVPR46437.2021.01146.
Syst. Sci., vol. 55, no. 1, pp. 119–139, Aug. 1997, doi: 10.1006/jcss. [72] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
1997.1504. Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
[52] T. Cover and P. Hart, ‘‘Nearest neighbor pattern classification,’’ IEEE Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788,
Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967, doi: doi: 10.1109/CVPR.2016.91.
10.1109/TIT.1967.1053964. [73] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[53] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ 2015,
A. Zisserman, ‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. arXiv:1512.02325.
Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010, doi: 10.1007/s11263- [74] H. Law and J. Deng, ‘‘CornerNet: Detecting objects as paired keypoints,’’
009-0275-4. Int. J. Comput. Vis., vol. 128, no. 3, pp. 642–656, Mar. 2020, doi:
[54] ImageNet Large Scale Visual Recognition Competition 2010 10.1007/s11263-019-01204-1.
(ILSVRC2010). Accessed: Oct. 22, 2019. [Online]. Available: [75] Z. Tian, C. Shen, H. Chen, and T. He, ‘‘FCOS: Fully convolutional
http://www.image-net.org/challenges/LSVRC/2010/ one-stage object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[55] MNIST Demos on Yann LeCun’s Website. Accessed: Oct. 22, 2019. (ICCV), Oct. 2019, pp. 9626–9635, doi: 10.1109/ICCV.2019.00972.
[Online]. Available: http://yann.lecun.com/exdb/lenet/ [76] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
[56] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, time object detection with region proposal networks,’’ IEEE Trans. Pat-
P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, ‘‘Microsoft tern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017, doi:
COCO: Common objects in context,’’ 2014, arXiv:1405.0312. Accessed: 10.1109/TPAMI.2016.2577031.
Oct. 26, 2019. [77] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ 2017,
[57] A. B. Amjoud and M. Amrouch, ‘‘Convolutional neural networks arXiv:1703.06870.
backbones for object detection,’’ in Image and Signal Processing, [78] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, ‘‘RepPoints: Point set rep-
A. El Moataz, D. Mammass, A. Mansouri, and F. Nouboud, Eds. resentation for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Cham, Switzerland: Springer, 2020, pp. 282–289, doi: 10.1007/978-3- Vis. (ICCV), Oct. 2019, pp. 9656–9665, doi: 10.1109/ICCV.2019.00975.
030-51935-3_30. [79] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for
[58] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks dense object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
for large-scale image recognition,’’ 2014, arXiv:1409.1556. Accessed: no. 2, pp. 318–327, Feb. 2020, doi: 10.1109/TPAMI.2018.2858826.
Oct. 22, 2019. [80] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ 2016,
[59] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for arXiv:1612.08242. Accessed: Oct. 8, 2019.
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [81] T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun, ‘‘MetaAnchor: Learning
nit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778, doi: to detect objects with customized anchors,’’ in Proc. 32nd Int. Conf.
10.1109/CVPR.2016.90. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates, 2018,
[60] M. Tan and Q. V. Le, ‘‘EfficientNet: Rethinking model scaling for pp. 318–328.
convolutional neural networks,’’ 2019, arXiv:1905.11946. Accessed: [82] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, ‘‘Region
Aug. 2, 2022. proposal by guided anchoring,’’ in Proc. IEEE/CVF Conf. Com-
[61] X. Du, T.-Y. Lin, P. Jin, G. Ghiasi, M. Tan, Y. Cui, Q. V. Le, put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2960–2969, doi:
and X. Song, ‘‘SpineNet: Learning scale-permuted backbone for 10.1109/CVPR.2019.00308.
recognition and localization,’’ in Proc. IEEE/CVF Conf. Comput. [83] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11589–11598, doi: A. W. M. Smeulders, ‘‘Selective Search for Object Recognition,’’ Int. J.
10.1109/CVPR42600.2020.01161. Comput. Vis., vol. 104, pp. 154–171, Apr. 2013, doi: 10.1007/s11263-
[62] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, 013-0620-5.
and I.-H. Yeh, ‘‘CSPNet: A new backbone that can enhance learn- [84] R. Girshick, ‘‘Fast R-CNN,’’ 2015, arXiv:1504.08083. Accessed:
ing capability of CNN,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- Oct. 8, 2019.
tern Recognit. Workshops (CVPRW), Jun. 2020, pp. 1571–1580, doi: [85] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierar-
10.1109/CVPRW50498.2020.00203. chies for accurate object detection and semantic segmentation,’’ 2013,
[63] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and arXiv:1311.2524. Accessed: Oct. 8, 2019.
S. Belongie, ‘‘Feature pyramid networks for object detection,’’ 2016, [86] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, ‘‘A unified multi-scale
arXiv:1612.03144. Accessed: Oct. 8, 2019. deep convolutional neural network for fast object detection,’’ in Computer
[64] G. Ghiasi, T.-Y. Lin, and Q. V. Le, ‘‘NAS-FPN: Learning scalable feature Vision—ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds.
pyramid architecture for object detection,’’ in Proc. IEEE/CVF Conf. Cham, Switzerland: Springer, 2016, pp. 354–370, doi: 10.1007/978-3-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7029–7038, doi: 319-46493-0_22.
10.1109/CVPR.2019.00720. [87] Z. Cai and N. Vasconcelos, ‘‘Cascade R-CNN: Delving into high qual-
[65] S. Liu, D. Huang, and Y. Wang, ‘‘Learning spatial fusion for single-shot ity object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
object detection,’’ 2019, arXiv:1911.09516. Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 6154–6162, doi:
[66] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ‘‘Path aggregation network for 10.1109/CVPR.2018.00644.
instance segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [88] H. Lee, S. Eum, and H. Kwon, ‘‘ME R-CNN: Multi-expert R-CNN for
Recognit., Jun. 2018, pp. 8759–8768, doi: 10.1109/CVPR.2018.00913. object detection,’’ 2017, arXiv:1704.01069.
[67] M. Tan, R. Pang, and Q. V. Le, ‘‘EfficientDet: Scalable and [89] Y. Li, Y. Chen, N. Wang, and Z.-X. Zhang, ‘‘Scale-aware trident networks
efficient object detection,’’ in Proc. IEEE/CVF Conf. Comput. for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10778–10787, doi: Oct. 2019, pp. 6053–6062, doi: 10.1109/ICCV.2019.00615.
10.1109/CVPR42600.2020.01079. [90] B. Singh and L. S. Davis, ‘‘An analysis of scale invariance in object
[68] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in deep detection—SNIP,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
convolutional networks for visual recognition,’’ 2014, arXiv:1406.4729. Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 3578–3587, doi:
[69] S. Liu, D. Huang, and Y. Wang, ‘‘Receptive field block net for accurate 10.1109/CVPR.2018.00377.
and fast object detection,’’ in Computer Vision—ECCV 2018. Berlin, [91] M. Najibi, B. Singh, and L. Davis, ‘‘AutoFocus: Efficient multi-scale
Germany: Springer-Verlag, Sep. 2018, pp. 404–419, doi: 10.1007/978- inference,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
3-030-01252-6_24. pp. 9744–9754, doi: 10.1109/ICCV.2019.00984.
[92] T. Kong, A. Yao, Y. Chen, and F. Sun, ‘‘HyperNet: Towards accurate [109] T. Wang, R. M. Anwer, H. Cholakkal, F. S. Khan, Y. Pang, and
region proposal generation and joint object detection,’’ in Proc. IEEE L. Shao, ‘‘Learning rich features at high-speed for single-shot object
Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
Jun. 2016, pp. 845–853, doi: 10.1109/CVPR.2016.98. pp. 1971–1980, doi: 10.1109/ICCV.2019.00206.
[93] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang, ‘‘Bounding box [110] J. Nie, R. M. Anwer, H. Cholakkal, F. S. Khan, Y. Pang, and
regression with uncertainty for accurate object detection,’’ in Proc. L. Shao, ‘‘Enriched feature guided refinement network for object detec-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, tion,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul,
pp. 2883–2892, doi: 10.1109/CVPR.2019.00300. South Korea, Oct. 2019, pp. 9536–9545, doi: 10.1109/ICCV.2019.
[94] M. Najibi, M. Rastegari, and L. S. Davis, ‘‘G-CNN: An iterative grid 00963.
based object detector,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [111] S. Li, L. Yang, J. Huang, X.-S. Hua, and L. Zhang, ‘‘Dynamic
nit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 2369–2377, doi: anchor feature selection for single-shot object detection,’’ in Proc.
10.1109/CVPR.2016.260. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 6608–6617,
[95] X. Wang, A. Shrivastava, and A. Gupta, ‘‘A-fast-RCNN: Hard positive doi: 10.1109/ICCV.2019.00671.
generation via adversary for object detection,’’ in Proc. IEEE Conf. [112] K. Chen, J. Li, W. Lin, J. See, J. Wang, L. Duan, Z. Chen,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3039–3048, doi: C. He, and J. Zou, ‘‘Towards accurate one-stage object detection with
10.1109/CVPR.2017.324. AP-loss,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[96] Z. Tan, X. Nie, Q. Qian, N. Li, and H. Li, ‘‘Learning to rank pro- nit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 5114–5122, doi:
posals for object detection,’’ in Proc. IEEE/CVF Int. Conf. Com- 10.1109/CVPR.2019.00526.
put. Vis. (ICCV), Oct. 2019, pp. 8272–8280, doi: 10.1109/ICCV.2019. [113] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, and S.-J. Ko, ‘‘Parallel
00836. feature pyramid network for object detection,’’ in Computer Vision—
[97] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ‘‘Libra R-CNN: ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.
Towards balanced learning for object detection,’’ in Proc. IEEE/CVF Cham, Switzerland: Springer, 2018, pp. 239–256, doi: 10.1007/978-3-
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 821–830, 030-01228-1_15.
doi: 10.1109/CVPR.2019.00091. [114] T. Kong, F. Sun, W. Huang, and H. Liu, ‘‘Deep feature pyramid reconfigu-
[98] S. Lazebnik, C. Schmid, and J. Ponce, ‘‘Beyond bags of features: ration for object detection,’’ in Computer Vision—ECCV 2018, V. Ferrari,
Spatial pyramid matching for recognizing natural scene categories,’’ M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham, Switzerland:
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Springer, 2018, pp. 172–188, doi: 10.1007/978-3-030-01228-1_11.
(CVPR), vol. 2, New York, NY, USA, Jun. 2006, pp. 2169–2178, doi: [115] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
10.1109/CVPR.2006.68. A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
[99] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convolu- Vis. Pattern Recognit., Miami, FL, USA, Jun. 2009, pp. 248–255, doi:
tional networks,’’ in Computer Vision—ECCV 2014, D. Fleet, T. Pajdla, 10.1109/CVPR.2009.5206848.
B. Schiele, and T. Tuytelaars, Eds. Cham, Switzerland: Springer, 2014, [116] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’
pp. 818–833, doi: 10.1007/978-3-319-10590-1_53. 2018, arXiv:1804.02767.
[100] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- [117] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted
based fully convolutional networks,’’ in Advances in Neural Information Boltzmann machines,’’ in Proc. 27th Int. Conf. Mach. Learn., Madison,
Processing Systems, vol. 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, WI, USA, Jun. 2010, pp. 807–814.
I. Guyon, and R. Garnett, Eds. Red Hook, NY, USA: Curran [118] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun,
Associates, 2016, pp. 379–387. Accessed: Apr. 10, 2020. [Online]. Avail- ‘‘MegDet: A large mini-batch object detector,’’ in Proc. IEEE/CVF
able: http://papers.nips.cc/paper/6465-r-fcn-object-detection-via-region- Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6181–6189, doi:
based-fully-convolutional-networks.pdf 10.1109/CVPR.2018.00647.
[101] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, [119] K. Kim and H. S. Lee, ‘‘Probabilistic anchor assignment with IoU predic-
Q. V. Le, and B. Zoph, ‘‘Simple copy-paste is a strong data augmen- tion for object detection,’’ in Computer Vision—ECCV 2020, A. Vedaldi,
tation method for instance segmentation,’’ in Proc. IEEE/CVF Conf. H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham, Switzerland: Springer,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 2917–2927, doi: 2020, pp. 355–371, doi: 10.1007/978-3-030-58595-2_22.
10.1109/CVPR46437.2021.00294. [120] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
[102] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, ‘‘Single-shot refine- D. Erhan, V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with con-
ment neural network for object detection,’’ in Proc. IEEE/CVF Conf. volutions,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, Boston, MA, USA, Jun. 2015, pp. 1–9, doi: 10.1109/CVPR.2015.
pp. 4203–4212, doi: 10.1109/CVPR.2018.00442. 7298594.
[103] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, ‘‘DSOD: Learning [121] X. Zhou, J. Zhuo, and P. Krähenbühl, ‘‘Bottom-up object detection
deeply supervised object detectors from scratch,’’ in Proc. IEEE Int. by grouping extreme and center points,’’ in Proc. IEEE/CVF Conf.
Conf. Comput. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 1937–1945, doi: Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 850–859, doi:
10.1109/ICCV.2017.212. 10.1109/CVPR.2019.00094.
[104] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and T. Mei, [122] C. Zhu, Y. He, and M. Savvides, ‘‘Feature selective anchor-free mod-
‘‘ScratchDet: Training single-shot object detectors from scratch,’’ in Proc. ule for single-shot object detection,’’ in Proc. IEEE/CVF Conf. Com-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 840–849, doi:
pp. 2263–2272, doi: 10.1109/CVPR.2019.00237. 10.1109/CVPR.2019.00093.
[105] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, ‘‘DSSD: Decon- [123] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, ‘‘Bridging
volutional single shot detector,’’ 2017, arXiv:1701.06659. Accessed: the gap between anchor-based and anchor-free detection via adap-
Oct. 8, 2019. tive training sample selection,’’ in Proc. IEEE/CVF Conf. Com-
[106] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, ‘‘RON: put. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9756–9765, doi:
Reverse connection with objectness prior networks for object detec- 10.1109/CVPR42600.2020.00978.
tion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [124] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, ‘‘OTA: Optimal trans-
Honolulu, HI, USA, Jul. 2017, pp. 5244–5252, doi: 10.1109/CVPR. port assignment for object detection,’’ in Proc. IEEE/CVF Conf. Com-
2017.557. put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 303–312, doi:
[107] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, ‘‘Scale-transferrable 10.1109/CVPR46437.2021.00037.
object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [125] P. A. Knight, ‘‘The Sinkhorn–Knopp algorithm: Convergence and appli-
Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 528–537, doi: cations,’’ SIAM J. Matrix Anal. Appl., vol. 30, no. 1, pp. 261–275,
10.1109/CVPR.2018.00062. Jan. 2008, doi: 10.1137/060659624.
[108] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille, ‘‘Single- [126] H. Su, Y. He, R. Jiang, J. Zhang, W. Zou, and B. Fan, ‘‘DSLA:
shot object detection with enriched semantics,’’ in Proc. IEEE/CVF Conf. Dynamic smooth label assignment for efficient anchor-free object detec-
Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, tion,’’ Pattern Recognit., vol. 131, Nov. 2022, Art. no. 108868, doi:
pp. 5813–5821, doi: 10.1109/CVPR.2018.00609. 10.1016/j.patcog.2022.108868.
[127] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [148] L. Zheng, C. Fu, and Y. Zhao, ‘‘Extend the shallow part of sin-
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, gle shot multibox detector via convolutional neural network,’’ 2018,
J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16 × 16 words: arXiv:1801.05918.
Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929. [149] Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, ‘‘Weaving multi-scale context
[128] S. Rao, Y. Li, R. Ramakrishnan, A. Hassaine, D. Canoy, J. Cleland, for single shot detector,’’ 2017, arXiv:1712.03149.
T. Lukasiewicz, G. Salimi-Khorshidi, and K. Rahimi, ‘‘An explainable [150] X. Wu, D. Zhang, J. Zhu, and S. C. H. Hoi, ‘‘Single-shot bidi-
transformer-based deep learning model for the prediction of incident heart rectional pyramid networks for high-quality object detection,’’ 2018,
failure,’’ IEEE J. Biomed. Health Informat., vol. 26, no. 7, pp. 3362–3372, arXiv:1803.08208. Accessed: Oct. 8, 2019.
Jul. 2022, doi: 10.1109/JBHI.2022.3148820. [151] Y. Pang, T. Wang, R. M. Anwer, F. S. Khan, and L. Shao, ‘‘Efficient
[129] A. B. Amjoud and M. Amrouch, ‘‘Automatic generation of chest X-ray featurized image pyramid network for single shot detector,’’ in Proc.
reports using a transformer-based deep learning model,’’ in Proc. 5th IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
Int. Conf. Intell. Comput. Data Sci. (ICDS), Oct. 2021, pp. 1–5, doi: pp. 7328–7336, doi: 10.1109/CVPR.2019.00751.
10.1109/ICDS53782.2021.9626725.
[152] K. Song, H. Yang, and Z. Yin, ‘‘Multi-scale attention deep neu-
[130] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
ral network for fast accurate object detection,’’ IEEE Trans. Circuits
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ 2020,
Syst. Video Technol., vol. 29, no. 10, pp. 2972–2985, Oct. 2019, doi:
arXiv:2005.12872.
10.1109/TCSVT.2018.2875449.
[131] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
[153] J. Cao, Y. Pang, J. Han, and X. Li, ‘‘Hierarchical shot detector,’’ in Proc.
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ 2017,
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9704–9713,
arXiv:1706.03762. Accessed: Mar. 4, 2021.
doi: 10.1109/ICCV.2019.00980.
[132] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, ‘‘Fast convergence
of DETR with spatially modulated co-attention,’’ in Proc. IEEE/CVF [154] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, and H. Lu, ‘‘CoupleNet: Cou-
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3601–3610, doi: pling global structure with local parts for object detection,’’ in Proc. IEEE
10.1109/ICCV48922.2021.00360. Int. Conf. Comput. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 4146–4154,
[133] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, doi: 10.1109/ICCV.2017.444.
‘‘Swin transformer: Hierarchical vision transformer using shifted Win- [155] J. Cao, Y. Pang, and X. Li, ‘‘Triply supervised decoder networks for
dows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, joint detection and segmentation,’’ in Proc. IEEE/CVF Conf. Com-
pp. 9992–10002, doi: 10.1109/ICCV48922.2021.00986. put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7384–7393, doi:
[134] Y. Wang, X. Zhang, T. Yang, and J. Sun, ‘‘Anchor DETR: Query design 10.1109/CVPR.2019.00757.
for transformer-based object detection,’’ 2021, arXiv:2109.07107. [156] Y. Zhu, C. Zhao, H. Guo, J. Wang, X. Zhao, and H. Lu, ‘‘Attention
[135] L. He and S. Todorovic, ‘‘DESTR: Object detection with CoupleNet: Fully convolutional attention coupling network for object
split transformer,’’ in Proc. IEEE/CVF Conf. Comput. Vis. detection,’’ IEEE Trans. Image Process., vol. 28, no. 1, pp. 113–126,
Pattern Recognit. (CVPR), Jun. 2022, pp. 9367–9376, doi: Jan. 2019, doi: 10.1109/TIP.2018.2865280.
10.1109/CVPR52688.2022.00916. [157] B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, and T. Huang, ‘‘Revisiting
[136] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Region-based convo- RCNN: On awakening the classification power of faster RCNN,’’ in Com-
lutional networks for accurate object detection and segmentation,’’ IEEE puter Vision—ECCV 2018. Berlin, Germany: Springer-Verlag, Sep. 2018,
Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, Jan. 2016, pp. 473–490, doi: 10.1007/978-3-030-01267-0_28.
doi: 10.1109/TPAMI.2015.2437384. [158] B. Singh, M. Najibi, and L. S. Davis, ‘‘SNIPER: Efficient multi-scale
[137] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, ‘‘Object detec- training,’’ in Proc. 32nd Int. Conf. Neural Inf. Process. Syst. Red Hook,
tion networks on convolutional feature maps,’’ 2015, arXiv:1504.06066. NY, USA: Curran Associates, 2018, pp. 9333–9343.
Accessed: Oct. 8, 2019. [159] W. Xiang, D.-Q. Zhang, H. Yu, and V. Athitsos, ‘‘Context-aware single-
[138] A. Shrivastava, A. Gupta, and R. Girshick, ‘‘Training region-based object shot detector,’’ in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV),
detectors with online hard example mining,’’ in Proc. IEEE Conf. Com- Mar. 2018, pp. 1784–1793, doi: 10.1109/WACV.2018.00198.
put. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, [160] H. Wang, Q. Wang, M. Gao, P. Li, and W. Zuo, ‘‘Multi-scale location-
pp. 761–769, doi: 10.1109/CVPR.2016.89. aware kernel representation for object detection,’’ in Proc. IEEE/CVF
[139] Y. Liu, R. Wang, S. Shan, and X. Chen, ‘‘Structure inference net: Object Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1248–1257, doi:
detection using scene-level context and instance-level relationships,’’ 10.1109/CVPR.2018.00136.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, [161] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, ‘‘YOLACT: Real-time instance
pp. 6985–6994, doi: 10.1109/CVPR.2018.00730. segmentation,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
[140] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, ‘‘Inside-outside Oct. 2019, pp. 9156–9165, doi: 10.1109/ICCV.2019.00925.
net: Detecting objects in context with skip pooling and recurrent
[162] Y. Xin, S. Wang, L. Li, W. Zhang, and Q. Huang, ‘‘Reverse densely
neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
connected feature pyramid network for object detection,’’ in Com-
nit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 2874–2883, doi:
puter Vision—ACCV 2018, C. V. Jawahar, H. Li, G. Mori, and
10.1109/CVPR.2016.314.
K. Schindler, Eds. Cham, Switzerland: Springer, 2019, pp. 530–545, doi:
[141] S. Gidaris and N. Komodakis, ‘‘LocNet: Improving localization accu-
10.1007/978-3-030-20873-8_34.
racy for object detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 789–798, doi: [163] X. Chen and A. Gupta, ‘‘Spatial memory for context reasoning in object
10.1109/CVPR.2016.92. detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[142] S. Gidaris and N. Komodakis, ‘‘Object detection via a multi-region & pp. 4106–4116, doi: 10.1109/ICCV.2017.440.
semantic segmentation-aware CNN model,’’ 2015, arXiv:1505.01749. [164] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chin-
Accessed: Oct. 8, 2019. tala, and P. Dollár, ‘‘A multipath network for object detection,’’ 2016,
[143] J. Jeong, H. Park, and N. Kwak, ‘‘Enhancement of SSD by concatenating arXiv:1604.02135.
feature maps for object detection,’’ 2017, arXiv:1705.09587. Accessed: [165] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling,
Oct. 8, 2019. ‘‘M2Det: A single-shot object detector based on multi-level fea-
[144] S. Woo, S. Hwang, and I. S. Kweon, ‘‘StairNet: Top-down semantic ture pyramid network,’’ in Proc. 33rd AAAI Conf. Artif. Intell. 31st
aggregation for accurate one shot detection,’’ in Proc. IEEE Winter Innov. Appl. Artif. Intell. Conf., 9th AAAI Symp. Educ. Adv. Artif.
Conf. Appl. Comput. Vis. (WACV), Lake Tahoe, NV, USA, Mar. 2018, Intell. Honolulu, HI, USA: AAAI Press, 2019, pp. 9259–9266, doi:
pp. 1093–1102, doi: 10.1109/WACV.2018.00125. 10.1609/aaai.v33i01.33019259.
[145] Z. Li and F. Zhou, ‘‘FSSD: Feature fusion single shot multibox detector,’’ [166] L. Tychsen-Smith and L. Petersson, ‘‘DeNet: Scalable real-time object
2017, arXiv:1712.00960. Accessed: Oct. 8, 2019. detection with directed sparse sampling,’’ in Proc. IEEE Int. Conf.
[146] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, ‘‘BlitzNet: A real- Comput. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 428–436, doi:
time deep network for scene understanding,’’ 2017, arXiv:1708.02813. 10.1109/ICCV.2017.54.
Accessed: Oct. 8, 2019. [167] S. Wang, Y. Gong, J. Xing, L. Huang, C. Huang, and W. Hu, ‘‘RDSNet:
[147] K. Lee, J. Choi, J. Jeong, and N. Kwak, ‘‘Residual features and unified A new deep architecture for reciprocal object detection and instance
prediction network for single stage detection,’’ 2017, arXiv:1707.05031. segmentation,’’ 2019, arXiv:1912.05070.
[168] X. Chen, R. Girshick, K. He, and P. Dollar, ‘‘TensorMask: A foundation [189] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, ‘‘Gen-
for dense object segmentation,’’ in Proc. IEEE/CVF Int. Conf. Comput. eralized focal loss: Learning qualified and distributed bounding boxes for
Vis. (ICCV), Oct. 2019, pp. 2061–2069, doi: 10.1109/ICCV.2019.00215. dense object detection,’’ in Proc. 34th Int. Conf. Neural Inf. Process. Syst.
[169] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, ‘‘Beyond Red Hook, NY, USA: Curran Associates, 2020, pp. 21002–21012.
skip connections: Top-down modulation for object detection,’’ 2016, [190] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye, ‘‘FreeAnchor: Learning to
arXiv:1612.06851. Accessed: Oct. 8, 2019. match anchors for visual object detection,’’ in Proc. Adv. Neural Inf.
[170] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, Process. Syst. Red Hook, NY, USA: Curran Associates, 2019, pp. 1–9.
‘‘Deformable convolutional networks,’’ 2017, arXiv:1703.06211. Accessed: Oct. 21, 2019.
Accessed: Oct. 8, 2019. [191] G. Song, Y. Liu, and X. Wang, ‘‘Revisiting the sibling head
[171] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, ‘‘Relation networks for object in object detector,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Pattern Recognit. (CVPR), Jun. 2020, pp. 11560–11569, doi:
Jun. 2018, pp. 3588–3597, doi: 10.1109/CVPR.2018.00378. 10.1109/CVPR42600.2020.01158.
[172] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa, ‘‘Deep [192] H. Law, Y. Teng, O. Russakovsky, and J. Deng, ‘‘CornerNet-lite: Efficient
regionlets for object detection,’’ in Computer Vision—ECCV 2018, keypoint based object detection,’’ 2019, arXiv:1904.08900.
V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham, [193] J. Wang, W. Zhang, Y. Cao, K. Chen, J. Pang, T. Gong, J. Shi, C. C. Loy,
Switzerland: Springer, 2018, pp. 827–844, doi: 10.1007/978-3-030- and D. Lin, ‘‘Side-aware boundary localization for more precise object
01252-6_49. detection,’’ in Computer Vision—ECCV 2020, A. Vedaldi, H. Bischof,
[173] J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai, ‘‘Learning region fea- T. Brox, and J.-M. Frahm, Eds. Cham, Switzerland: Springer, 2020,
tures for object detection,’’ in Computer Vision—ECCV 2018, V. Ferrari, pp. 403–419, doi: 10.1007/978-3-030-58548-8_24.
M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham, Switzerland: [194] W. Ke, T. Zhang, Z. Huang, Q. Ye, J. Liu, and D. Huang, ‘‘Multiple
Springer, 2018, pp. 392–406, doi: 10.1007/978-3-030-01258-8_24. anchor learning for visual object detection,’’ in Proc. IEEE/CVF Conf.
[174] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi, ‘‘Consistent optimization for Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10203–10212,
single-shot object detection,’’ 2019, arXiv:1901.06563. doi: 10.1109/CVPR42600.2020.01022.
[175] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, ‘‘DetNet: [195] H. Li, Z. Wu, C. Zhu, C. Xiong, R. Socher, and L. S. Davis, ‘‘Learn-
Design backbone for object detection,’’ in Computer Vision—ECCV ing from noisy anchors for one-stage object detection,’’ in Proc.
2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
Cham, Switzerland: Springer, 2018, pp. 339–354, doi: 10.1007/978-3- pp. 10585–10594, doi: 10.1109/CVPR42600.2020.01060.
030-01240-3_21. [196] Z. Sun, S. Cao, Y. Yang, and K. Kitani, ‘‘Rethinking transformer-
[176] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, ‘‘Acquisition of local- based set prediction for object detection,’’ in Proc. IEEE/CVF
ization confidence for accurate object detection,’’ in Computer Vision— Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3591–3600, doi:
ECCV 2018. Berlin, Germany: Springer-Verlag, Sep. 2018, pp. 816–832, 10.1109/ICCV48922.2021.00359.
doi: 10.1007/978-3-030-01264-9_48. [197] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, ‘‘CenterNet: Key-
[177] Y. Lee and J. Park, ‘‘CenterMask: Real-time anchor-free instance segmen- point triplets for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput.
tation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vis. (ICCV), Oct. 2019, pp. 6568–6577, doi: 10.1109/ICCV.2019.00667.
Jun. 2020, pp. 13903–13912, doi: 10.1109/CVPR42600.2020.01392. [198] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren,
[178] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, ‘‘Soft-NMS— S. Han, E. Ding, and S. Wen, ‘‘PP-YOLO: An effective and efficient
Improving object detection with one line of code,’’ in Proc. IEEE implementation of object detector,’’ 2020, arXiv:2007.12099.
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 5562–5570, doi: [199] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, ‘‘Scaled-YOLOv4:
10.1109/ICCV.2017.593. Scaling cross stage partial network,’’ in Proc. IEEE/CVF Conf. Com-
[179] H. Zhang, H. Chang, B. Ma, S. Shan, and X. Chen, ‘‘Cascade Reti- put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13024–13033, doi:
naNet: Maintaining consistency for single-stage object detection,’’ 2019, 10.1109/CVPR46437.2021.01283.
arXiv:1907.06881. [200] L. Yao, H. Xu, W. Zhang, X. Liang, and Z. Li, ‘‘SM-NAS: Structural-
[180] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal to-modular neural architecture search for object detection,’’ 2019,
speed and accuracy of object detection,’’ 2020, arXiv:2004.10934. arXiv:1911.09929.
[181] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, ‘‘SOD-MTGAN: Small [201] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and
object detection via multi-task generative adversarial network,’’ in Com- J. Wang, ‘‘Conditional DETR for fast training convergence,’’ 2021,
puter Vision—ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and arXiv:2108.06152.
Y. Weiss, Eds. Cham, Switzerland: Springer, 2018, pp. 210–226, doi: [202] C. Zhu, F. Chen, Z. Shen, and M. Savvides, ‘‘Soft anchor-point object
10.1007/978-3-030-01261-8_13. detection,’’ in Computer Vision—ECCV 2020, A. Vedaldi, H. Bischof,
[182] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, ‘‘Light- T. Brox, and J.-M. Frahm, Eds. Cham, Switzerland: Springer, 2020,
head R-CNN: In defense of two-stage object detector,’’ 2017, pp. 91–107, doi: 10.1007/978-3-030-58545-7_6.
arXiv:1711.07264. [203] Z. Dong, G. Li, Y. Liao, F. Wang, P. Ren, and C. Qian, ‘‘Centripetal-
[183] B. Li, Y. Liu, and X. Wang, ‘‘Gradient harmonized single-stage detector,’’ Net: Pursuing high-quality keypoint pairs for object detection,’’ in Proc.
in Proc. 33rd AAAI Conf. Artif. Intell. 31st Innov. Appl. Artif. Intell. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
9th AAAI Symp. Educ. Adv. Artif. Intell. Honolulu, HI, USA: AAAI Press, pp. 10516–10525, doi: 10.1109/CVPR42600.2020.01053.
2019, pp. 8577–8584, doi: 10.1609/aaai.v33i01.33018577. [204] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun, ‘‘AutoAs-
[184] J. Peng, M. Sun, Z.-X. Zhang, T. Tan, and J. Yan, ‘‘Efficient neural sign: Differentiable label assignment for dense object detection,’’ 2020,
architecture transformation search in channel-level for object detection,’’ arXiv:2007.03496.
in Proc. 33rd Int. Conf. Neural Inf. Process. Syst. Red Hook, NY, USA: [205] H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen, ‘‘Dynamic R-CNN:
Curran Associates, 2019, pp. 14335–14344. Towards high quality object detection via dynamic training,’’ in Computer
[185] L. Tychsen-Smith and L. Petersson, ‘‘Improving object localization Vision—ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm,
with fitness NMS and bounded IoU loss,’’ in Proc. IEEE/CVF Conf. Eds. Cham, Switzerland: Springer, 2020, pp. 260–275, doi: 10.1007/978-
Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018, 3-030-58555-6_16.
pp. 6877–6885, doi: 10.1109/CVPR.2018.00719. [206] K. Chen, W. Ouyang, C. C. Loy, D. Lin, J. Pang, J. Wang, Y. Xiong, X. Li,
[186] Z. Chen, S. Huang, and D. Tao, ‘‘Context refinement for object S. Sun, W. Feng, Z. Liu, and J. Shi, ‘‘Hybrid task cascade for instance
detection,’’ in Computer Vision—ECCV 2018, V. Ferrari, M. Hebert, segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
C. Sminchisescu, and Y. Weiss, Eds. Cham, Switzerland: Springer, 2018, (CVPR), Jun. 2019, pp. 4969–4978, doi: 10.1109/CVPR.2019.00511.
pp. 74–89, doi: 10.1007/978-3-030-01237-3_5. [207] H. Qiu, Y. Ma, Z. Li, S. Liu, and J. Sun, ‘‘BorderDet: Border feature
[187] T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, ‘‘FoveaBox: Beyound for dense object detection,’’ in Computer Vision—ECCV 2020. Berlin,
anchor-based object detection,’’ IEEE Trans. Image Process., vol. 29, Germany: Springer-Verlag, Aug. 2020, pp. 549–564, doi: 10.1007/978-
pp. 7389–7398, 2020, doi: 10.1109/TIP.2020.3002345. 3-030-58452-8_32.
[188] C.-Y. Fu, M. Shvets, and A. C. Berg, ‘‘RetinaMask: Learning to predict [208] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable
masks improves state-of-the-art single-shot detection for free,’’ 2019, DETR: Deformable transformers for end-to-end object detection,’’ 2020,
arXiv:1901.03353. arXiv:2010.04159.
[209] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, ‘‘You only learn one represen- [226] T. Liang, X. Chu, Y. Liu, Y. Wang, Z. Tang, W. Chu, J. Chen, and
tation: Unified network for multiple tasks,’’ 2021, arXiv:2105.04206. H. Ling, ‘‘CBNet: A composite backbone network architecture for object
[210] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, detection,’’ IEEE Trans. Image Process., vol. 31, pp. 6893–6906, 2022,
‘‘Dynamic head: Unifying object detection heads with attentions,’’ doi: 10.1109/TIP.2022.3216771.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), [227] X. Wang, S. Zhang, Z. Yu, L. Feng, and W. Zhang, ‘‘Scale-equalizing
Jun. 2021, pp. 7369–7378, doi: 10.1109/CVPR46437.2021.00729. pyramid convolution for object detection,’’ in Proc. IEEE/CVF Conf.
[211] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13356–13365,
‘‘End-to-end semi-supervised object detection with soft teacher,’’ 2021, doi: 10.1109/CVPR42600.2020.01337.
arXiv:2106.09018. [228] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, ‘‘Inception-v4,
[212] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, inception-ResNet and the impact of residual connections on learning,’’ in
Z. Zhang, L. Dong, F. Wei, and B. Guo, ‘‘Swin transformer v2: Proc. 31st AAAI Conf. Artif. Intell., San Francisco, CA, USA, Feb. 2017,
Scaling up capacity and resolution,’’ in Proc. IEEE/CVF Conf. Com- pp. 4278–4284.
put. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11999–12009, doi: [229] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
10.1109/CVPR52688.2022.01170. connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
[213] X. Chen, J. Yu, S. Kong, Z. Wu, and L. Wen, ‘‘Joint anchor-feature Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 2261–2269,
refinement for real-time accurate object detection in images and videos,’’ doi: 10.1109/CVPR.2017.243.
2018, arXiv:1807.08638. [230] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in
[214] Z. Wang, J. Guo, C. Zhang, and B. Wang, ‘‘Multiscale feature Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
enhancement network for salient object detection in optical remote pp. 7132–7141, doi: 10.1109/CVPR.2018.00745.
sensing images,’’ IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, [231] A. Newell, K. Yang, and J. Deng, ‘‘Stacked hourglass networks for human
Art. no. 5634819, doi: 10.1109/TGRS.2022.3224815. pose estimation,’’ in Computer Vision—ECCV 2016, vol. 9912, B. Leibe,
[215] A. Shafique, G. Cao, Z. Khan, M. Asad, and M. Aslam, ‘‘Deep learning- J. Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer,
based change detection in remote sensing images: A review,’’ Remote 2016, pp. 483–499, doi: 10.1007/978-3-319-46484-8_29.
Sens., vol. 14, no. 4, p. 871, Feb. 2022. [232] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
[216] S. Shokirov, T. Jucker, S. R. Levick, A. D. Manning, T. Bonnet, M. Yebra, ‘‘A ConvNet for the 2020s,’’ in Proc. IEEE/CVF Conf. Comput.
and K. N. Youngentob, ‘‘Habitat highs and lows: Using terrestrial and Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11966–11976, doi:
UAV LiDAR for modelling avian species richness and abundance in 10.1109/CVPR52688.2022.01167.
a restored woodland,’’ Remote Sens. Environ., vol. 285, Feb. 2023, [233] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, ‘‘YOLOv7: Trainable
Art. no. 113326. bag-of-freebies sets new state-of-the-art for real-time object detectors,’’
[217] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, ‘‘A deep learn- 2022, arXiv:2207.02696.
ing approach to network intrusion detection,’’ IEEE Trans. Emerg.
Topics Comput. Intell., vol. 2, no. 1, pp. 41–50, Feb. 2018, doi:
10.1109/TETCI.2017.2772792.
[218] S. Majchrowska, A. Mikołajczyk, M. Ferlin, Z. Klawikowska,
M. A. Plantykow, A. Kwasigroch, and K. Majek, ‘‘Deep learning-based
waste detection in natural and urban environments,’’ Waste Manage.,
vol. 138, pp. 274–284, Feb. 2022, doi: 10.1016/j.wasman.2021.12.001.
[219] A. Bhattacharyya, D. Bhaik, S. Kumar, P. Thakur, R. Sharma, and AYOUB BENALI AMJOUD received the engineering degree in information
R. B. Pachori, ‘‘A deep learning based approach for automatic systems and computer networks from the Faculty of Science and Technology
detection of COVID-19 cases using chest X-ray images,’’ Biomed. of Marrakech, Cadi Ayyad University, in 2017. He is currently pursuing the
Signal Process. Control, vol. 71, Jan. 2022, Art. no. 103182, doi: Ph.D. degree with Image et Reconnaissance de Forme-Systèmes Intelligents
10.1016/j.bspc.2021.103182. et Communicants (IRF-SIC) Laboratory, Faculty of Sciences of Agadir, Ibn
[220] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Zohr University. He worked with national and international organizations
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei- and held research and development positions. His research interests include
Fei, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. Comput. deep learning in computer vision, pattern recognition, and image captioning.
Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015- He received the Grant of Excellence from the National Centre for Scientific
0816-y. and Technical Research of Morocco. This grant is awarded to the most
[221] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, outstanding student researchers in Morocco.
S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari,
‘‘The open images dataset V4,’’ Int. J. Comput. Vis., vol. 128, no. 7,
pp. 1956–1981, Jul. 2020, doi: 10.1007/s11263-020-01316-z.
[222] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, ‘‘Random eras-
ing data augmentation,’’ in Proc. AAAI, Apr. 2020, vol. 34, no. 7,
pp. 13001–13008, doi: 10.1609/aaai.v34i07.7000.
[223] X. Zhu, H. Hu, S. Lin, and J. Dai, ‘‘Deformable ConvNets v2: More
MUSTAPHA AMROUCH received the master’s degree in mathematics and
deformable, better results,’’ 2018, arXiv:1811.11168.
[224] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking applied computer science and the Ph.D. degree in computer vision from the
the inception architecture for computer vision,’’ in Proc. IEEE Conf. Faculty of Sciences, Ibn Zohr University, Agadir, Morocco, in 2007 and
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826, doi: 2012, respectively. He is currently a Professor and a Researcher with
10.1109/CVPR.2016.308. Image et Reconnaissance de Forme-Systèmes Intelligents et Communicants
[225] S. Qiao, L.-C. Chen, and A. Yuille, ‘‘DetectoRS: Detecting objects with (IRF-SIC) Laboratory, Ibn Zohr University. His research interests include
recursive feature pyramid and switchable atrous convolution,’’ in Proc. understanding the principles behind machine learning and computer vision,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, and improving and applying their algorithms to build real applications.
pp. 10208–10219, doi: 10.1109/CVPR46437.2021.01008.