Anchor-Based Vs Anchor-Free Object
Anchor-Based Vs Anchor-Free Object
Anchor-Based Vs Anchor-Free Object
Abstract - As one of the core mission of computer vision, classifier, and adopts the multi-component strategy to improve
object detection has been widely applied with the rapid the deformation problem of the object, thus winning the
development of computer technology, especially in the fields of champion of VOC datasets from 2007 to 2009 [4]. However,
face detection, behavior detection, auto driving, and intelligent
monitoring. Aiming at the shortcomings of the traditional object
the traditional object detectors have the following
detection, such as low detection accuracy, low efficiency and poor disadvantages: the method of selecting region proposals using
robustness, this article summarizes deep learning-based detectors sliding windows is not objected, it generates a deal number of
of two kinds of modules: Anchor-Based and Anchor-Free. The redundant boxes. What’s more, it has high time complexity
performance of each detector is compared and analyzed in this and poor efficiency. There are poor robustness and accuracy of
paper as well. In addition, we summarize the development of the feature based on manual labeling as well.
object detection’s key technologies from the aspect of As the speed development of machine vision, object
improvement of Backbone, the optimization of NMS, imbalance detection has transitioned from traditional algorithms to the
of positive and negative samples’ solution etc. Finally, the stage with convolutional neural networks (CNN). Object
development trend of object detection is discussed from the
prospect of lightweight detection model, weakly supervision
detection based on deep learning breaks through the
detection and small object detection etc. bottleneck of traditional detection, improving detection’s
accuracy and efficiency significantly. According to the
Index Terms - Computer vision, Object detection, Deep different training methods, Anchor-Based module can be
learning, Anchor-Free module, Anchor-Based module divided into two kinds of modules: Two-stage detectors based
on region proposals and One-Stage detectors based on
I. INTRODUCTION regression. The two-stage detection method has two steps:
Object detection is a method that needs to complete the first, the region proposals are extracted by a specific
two tasks of object’s category recognition and its precise algorithm, and then the CNN is used to classify and refine the
location. It is mainly used to detect a certain kind of semantic location of the bounding box. There are several representative
objects in digital images and videos, such as cars or detectors: R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN, etc.
pedestrians in the field of transportation, animals in breeding The single-stage detectors do not need to generate region
industry and cells in biological images. Real-time and proposals, and converts the location into regression of the
accuracy are two essential properties that the detectors should bounding box directly, the typical methods include YOLO and
possess simultaneously. Object detection has been widely used SSD series.
in face detection, automatic driving, video surveillance, However, the Anchor-Based module has certain defects,
machine vision and other fields. However, in practical such as the pre-defined scale and aspect ratio of the anchor,
applications, object detection is faced with numerous which has different requirements for different datasets. The
challenges, especially the complexity of the scene, the anchor cannot be automatically adjusted, making it difficult to
morphology and position variability of the object, and the match extreme-scale objects. Too many presets of the anchor
variety of categories. After painstaking research by many make it prone to imbalance between positive and negative
scholars in these decades, the object detection technology has samples during training. Recently, Anchor-free module is
developed from the traditional detection technology to the gradually emerging.
detection integrated with deep learning. II. ANCHOR-BASED
Traditional object detectors: (1) Viola-Jones face
detector: haar-like is used for feature extraction and then Anchor-Based detection includes two modules:
Adaboost algorithm-based Cascade classifier is used for single-stage detectors and two-stage detectors, Fig. 1 shows
classification [1]. (2) A model for pedestrian detector: extracts the branch diagram of Anchor-Based detectors and their
feature by HOG and classify by SVM [2][3]. (3) DPM characteristics.
algorithm combines the improved HOG feature with the SVM
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
(Region of Interest) pooling layer. ROI is the single-scale
pooling, which can be regarded as a simplification of the SPP.
In addition, Fast R-CNN uses the softmax instead of SVM for
classification. It also introduces the multi-task training mode,
so that both classification and regression tasks can share
convolution features and perform simultaneously. It has
greatly improvement in detection precision and speed.
However, Fast R-CNN still uses the selective search to extract
the region proposals, which consume plenty of time and
cannot meet the real-time’s demands.
4) Faster R-CNN: This algorithm was proposed by
Shaoqin Ren et al in 2015. Instead of selective search, it
extracts proposals by Region Proposal Network (RPN) [8].
After setting anchors of muti-scales on a given image, it uses
Softmax to determine whether the anchor point is the
foreground or background. Then it performs the first bounding
box regression to obtain accurate proposals. After that, it
extracts feature of each region proposal by ROI pooling layer
and generates fixed dimension output, and completes the
Fig. 1 Branch diagram of Anchor-Based detectors
classification and the twice regression by the fully connected
A. Two-Stage Detectors Based on Region Proposal
layer, which is the same as the part of Fast R-CNN. It
1)R-CNN: R-CNN is proposed by Ross Girshick in 2013,
completes processes of feature extraction, region proposal’s
which is a detector that firstly detects object by deep learning
generation, classification, and regression through a network,
[5]. It extracts about 2, 000 proposals from the given image by
which truly implements end-to-end training. It improves
selective search algorithm, and adjusts them to a fixed size. It
detection accuracy and speed, and has been applied to
performs feature extraction through a convolutional network,
pedestrian detection, face detection, traffic sign detection, and
and uses SVM classifiers for classification, then it’s sent to the
remote sensing detection, etc. However, the amount of
part of bounding box regression. Although its accuracy and
computation for obtaining region proposals and classifying
real-time performance have been greatly improved compared
each region is still large, and it is not able to detect object in
to traditional algorithms, and it has been applied in the fields
real time.
of digital detection and vehicle area detection. However, there
5) R-FCN: Facing with the contradiction between the
are also many disadvantages, such as the complicated training
translation-invariance of image classification and the
and test structure. When extracting features from region
translation-variance of object detection, JiFeng Dai et al.
proposals separately, it may cause redundant computation so
proposed R-FCN (Region-based Fully Convolution Network)
that the detection speed is too slow. And forcing the scaling of
[9]. R-FCN follows the framework structure of Faster R-CNN,
candidate regions may cause distortion of the object and
removing the fully connected network after the ROI pooling
detection performance loss.
layer and adopting a position-sensitive score map for
2) SPP-Net: Kaiming He proposed the SPP-Net in 2014,
computing classification score of every ROI’s cell. In parallel
which added spatial pyramid pooling between the last
with that, a regression score map branch is used for coordinate
convolutional layer and the first fully connected layer on the
fine-tuning of ROI. Then it obtains the location and
basis of R-CNN [6]. This detector allows the network to input
classification through ROI pooling. R-FCN uses a fully
images of any size, which avoids the detection performance
convolutional neural network to achieve feature sharing, and
loss caused by cropping or warping the image of R-CNN. It
avoids redundant computation of region proposals through
can generate outputs in fixed dimensions so as to meet the
fully connected layers. Both the detection speed and accuracy
requirements of the fully connected layer. In terms of speed of
have increased to a great extent on the basis of Faster R-CNN.
detection, SPP-Net has great advantage than R-CNN. The
There has a wide range of applications in various fields, such
detector uses a convolution network to extract features for a
as pedestrian detection, face detection, remote sensing
given image only once, avoiding the complicated computing
detection, gesture recognition and financial scene recognition.
process of all region proposals separately. However, the
6) Mask R-CNN: It was brought forward by He in 2017,
SPP-Net is also the multi-stage pipeline, the steps are still
which performs instance segmentation at the same time as
more tedious, and the problem of storage space consumption
object detection [10]. The detector adds a branch for
has not been solved.
predicting the binary mask on the basis of Faster R-CNN,
3) Fast R-CNN: Aiming at the disadvantages of the
juxtaposed with the original classification and regression
R-CNN, Girshick et al. proposed Fast R-CNN in 2014, which
branches. In order to avoid the difference with the original
integrates the idea of spatial pyramid pooling (SPP) [7]. This
image, which caused by twice coordinate quantization
detector is improved on the basis of the first two detectors,
operations on the feature map in ROI Pooling layer, Mask
replacing the spatial pyramid pooling layer with the ROI
R-CNN replaces the ROI Pooling layer with ROI Align. It
1059
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
eliminates all quantization operations, and using four pixel and reduce the memory cost. It also adds SPP blocks to
values around the virtual points in the figure to estimate the enhance the receptive field. In addition, the author proposed
pixel values at non-integer positions, which improved the the Bag of freebies to train a more accurate model without
mask accuracy by 10%-50%. In addition, it can also realize increasing the inference consumption, including the data
the fine-grained segmentation, having greater flexibility, and enhancement method through cutmix, mosaic and
has been used for ground identification detection, ship object SAT(Self-Adversarial Training); dropout regularization
detection and other fields. method to prevent overfitting; class label smoothing method to
B. Single-stage Detectors Based on Regression improve generalization capability; CmBN(Cross mini-Batch
1)YOLO v1: Different from the two-stage object Normalization) for batch normalization of multiple
detectors, the YOLO (You Only Look Once) omits the stage of mini-batches in a batch; And the CIoU loss of the bounding
region proposal extraction [11]. Instead, it uses a branchless box regression. Bag of specials can significantly improve the
convolutional network to complete the feature extraction, accuracy of the model by adding a small amount of inference
regression and classification processes, thereby simplifying cost, including modified SAM(Spatial Attention Model),
the detector structure. The algorithm performs detection based which modifies the spatial-wise attention into point-wise
on the global information of the image, adjusting the given attention. The Mish activation function is smoother than that
image to a fixed pixel size and then divided it into 7×7 grid of other activation functions. In general, YOLO v4 modified
cells. Each cell is responsible for the object detection whose several detection techniques and used a number of detection
center point falling on the it, and predicts two bounding boxes, tricks to improve the detector’s effect. Compared with YOLO
then takes the redundant bounding boxes away by NMS. The v3, YOLO v4 improved its mAP by 10% and its FPS by 12%
structure of YOLO is simple, which greatly improves the [14].
detection speed. However, due to the detection of the entire 5) SSD: SSD (Single Shot Multibox Detector) was
image, it is difficult to distinguish dense objects and not proposed by Wei Liu et al, which incorporates the anchor
suitable for small object detection as well. mechanism on the basis of YOLO v1, improving the detection
2) YOLO v2: In 2017, Joseph Redmon et al. proposed accuracy while keeping high speed of YOLO [15]. SSD uses
YOLO v2, which uses Darknet-19 as the feature extraction VGG16 as the backbone for feature extraction, hierarchically
network, and adds a batch normalization operation after each extracts features to detect objects of multi-scales. It detects
convolution layer [12]. The classification network is small-scale objects in the low-level feature map output from
fine-tuned to a resolution of 448 × 448, which improves the the previous convolution layer, and detects large-scale objects
detection accuracy. In addition, YOLO v2 introduces anchors, in the high-level feature map, then perform bounding box
removing the fully connected layer and the last pooling layer regression and classification in order. However, the basic size
of the original network. It uses anchor boxes in the and shape of the prior box on each layer of feature maps need
convolutional layer to predict the bounding boxes and to be set manually, making the debugging process dependent
K-means clustering to get better priors. Even though the mAP on experience. In addition, because the number of low-level
(mean average precision) is slightly decreased, the recall is feature convolutional layers is small and lacks high-level
greatly increased. YOLO v2 is more suitable for the detection semantic features, the feature extraction is insufficient, so that
of large-scale objects. The author incorporates the fine-grained the recall rate is still average when detecting small object
features and combines high-level and low-level feature maps objects. But its performance of multi-scale detection has
to optimize the detection of small objects. It plays a significant played an important role in various tasks such as pedestrian
role in traffic signs and moving vehicle detection. detection, face detection, and remote sensing detection.
3) YOLO v3: YOLO v3 is an improvement on v1 and v2, 6) DSSD: Faced with the poor robustness when the SSD
and further improved the accuracy while maintaining high detects small objects, Cheng Yang Fu et al. proposed DSSD
real-time performance, especially the ability to recognize (Deconvolutional Single Shot Multibox Detector) on the basis
small objects [13]. In the process of feature extraction, YOLO of SSD in 2017 [16]. This detector replaces VGG with
v3 uses the Darknet-53 network and adopts the practice of ResNet-101 as backbone, and changes the original structure to
ResNet for reference; it continues the method of K-means the Resnet to extract deeper dimension features for
clustering to obtain the size of the bounding box. Instead of classification and regression, which deepens the network and
Softmax, it uses logistic (muti-label classifier) for improves its representation capabilities. The convolutional
classification. What’s more, it adds the multi-scale prediction layer combines high-level strong semantic feature maps with
method, performing independent detection on the fusion low-level feature maps, making full use of contextual
feature maps of multiple scales. Compared with the previous information. It improves the representation capabilities of low
single-scale detectors, it has better detection effect on small layers, and optimizing the detection effect of small objects but
objects. YOLO v3 can be applied to various scenarios such as reduces the detection speed due to the deeper backbone.
infrared objects, remote sensing image detection, ship tracking 7) RetinaNet: Although the single-stage detectors are
and recognition. faster, it is inferior to the two-stage detectors in terms of
4) YOLO v4: It was proposed by AlexeyAB in 2020. Its accuracy. The author believes that the reason is the
backbone is CSPDarknet, which is a combination of CSPNet foreground-background class imbalance. Because the
and Darknet. It can enhance the learning ability of the CNN single-stage detectors are based on the global information of
1060
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
the image, the number of easy-negative samples is much the bounding box by key point location combination; Dense
larger than the hard positive samples, which reduces the prediction method: Based on the idea of semantic
overall detection accuracy. In order to solve this problem, the segmentation, the object detection process is completed by
author improved on the basis of cross-entropy loss, and predicting the object center and the boundary distance from
proposed Focal loss to reduce the weight of easy negative the point to the bounding box pixel by pixel. Fig. 2 shows the
background samples and focus training on positive samples branch diagram of Anchor-free detectors and their
that are difficult to classify. To test the effect of focal loss, the characteristics.
author designed RetinaNet, using ResNet and feature pyramid A. Keypoint-Based
structure FPN as the backbone to obtain a feature map After the anchor-free model based on key points was
containing multi-scale object area information [17]. Two proposed, the detection no longer required to set anchors,
subnetworks of the full convolutional network FCN are used which turned into a pair of key points. CornerNet extracted the
for classification and regression. The detection accuracy is features of the two key points in the top left and bottom right
higher, and the detection effect on small objects is better. corners. Then it uses Heatmaps, Embeddings, and Offsets to
8) GA-RPN: Aiming at the shortcomings of anchor, a complete three processes: classify keypoints respectively,
new way to guide anchor generation through image features, match the objects, and reduce accuracy loss during
GA-RPN (Guided Anchoring) was proposed, pointing out that downsampling respectively [20]. Based on the drawback of
existing anchor design should follow two criteria: alignment CornerNet, which only focus on edge feature extraction and
and consistency [18]. The process of generating anchor can be ignore internal region information, CenterNet has added
divided into two parts of location prediction and shape detection of the center point so that it can better characterize
prediction. The location prediction process is to generate a the object [21]. Essentially, CornerNet determine the object of
location score map of by a subnetwork, discarding most the bounding box, and the keypoint is basically outside the
useless anchors to filter out regions that can generate anchors. object, which is the motivation of ExtremeNet. It detects the
The shape prediction part uses IoU as the supervision to learn object of the leftmost, the rightmost, uppermost, lowermost,
the width and height of the anchor, and uses the (scaling and center five points [22]. The detection idea is roughly the
factor) and s (subsampling step) to compress the search range, same as CornerNet, but the combination of points is in the
so as to design the most reasonable width and height for each way of exhaustive method and the center point to determine
anchor. GA-RPN added a Feature adaption module to adapt instead of embedding.
the features of the anchor, incorporating the idea of B. Dense Predict
deformable convolution. The offset field is based on the shape Similarly, in order to avoid the complicated computation
prediction process. The author applies GA-RPN instead of of anchor, FCOS detects objects by pixel-by-pixel prediction
RPN to multiple detection models, and the detection accuracy based on the idea of semantic segmentation [23].
is improved to a certain extent. It predicts by the FPN’s multi-scale strategy and adds a
9) M2Det: In order to make the multi-scale object center-ness branch in parallel with the classification branch.
detection more effectively, Qijie Zhao et al. created a new The output weights of center-ness decrease as the distance
detector-M2Det in 2019, which combined the multi-level between the box and the center of the object increases, then
feature pyramid network MLFPN and SSD structure [19]. multiply it with the classification score to rank during testing.
After performing feature extraction with the backbone of As a result, the low-quality boxes that are far away from the
VGG-16, ResNet-101 and MLFPN, the SSD structure is used object center can be inhibited, thus the detection accuracy
to obtain the bounding box location and class recognition improved. Foveabox consists of FPN and two branches (A
results. MLFPN consists of a feature fusion module FFM, a category prediction branch and a box regression branch). Its
thin U-shaped module TUM, and a scale feature aggregation use control factors to control the range of bounding boxes’
module SFAM. The low-level and high-level features scale that are predicted on different layers in the FPN. And it
extracted by the backbone are fused by FFM v1 to obtain the obtains the mapping coordinates of object’s center on the FPN
base feature. Each TUM can generate feature maps of different with the ground truth, based on it, adds zoom factor V 1 , V 2 to
scales. The base feature and the output of the previous TUM
select the positive and negative samples, predicts category for
are fused through FFM v2 as the input of the next TUM. The
every pixel of positive samples. In the regression branch, it
SFAM module is responsible for aggregating the different
scale features of the 8 TUMs to construct a multi-level feature
pyramid, and then perform regression and classification.
M2Det is a single-stage detector that uses multi-level and
multi-scale features to detect objects at different scales, greatly
optimizing the detection effect.
III. ANCHOR-FREE
Aiming at all kinds of defects that anchor needs to set in
advance, Anchor-free module uses two kinds of methods to
detect object, which are: keypoint-based method: describing Fig. 2 Branch diagram of Anchor-Free detectors
1061
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
learns a transform between the predicted bounding box 0.5 x 2 if x 1
and ground truth instead predicting the coordinates of boxes smooth L1 x ® (3)
directly [24]. FSAF is similar to Foveabox in suppressing ¯ x 0.5 otherwise
boxes that are far from the center of the object, but it is
Lloc t u ,X ¦ smoothL t 1
u
i Xi (4)
different from the above two methods in the multi-scale ^ `
i x , y , w, h
strategy. FSAF no longer manually assigns different layers of UnitBox proposed IoU loss, which directly regards IoU
prediction boxes, but calculates the sum of focal loss and IoU as the loss function [34]. The calculation formulas show as (5)
loss for each anchor-free branch. Then assign the instance to and (6). It is robust to objects with different scales, but it
the feature layer with the smallest loss to achieve feature cannot reflect the distance between non-overlapping
selection automation for different objects [25]. prediction boxes and ground truth boxes. Based on it, GIoU, a
more reasonable metric in the overlapping case was proposed,
IV. EVOLUTION OF DETECTION TECHNIQUE
but its convergence speed is slow [35]. The calculation
A. Backbone Improvement formulas show as (7) and (8). DIoU not only considers the
In the process of object detection, the backbone plays an overlapping area, but also adds a penalty term to minimize the
indispensable role in extracting image feature, which directly distance between the center points of the two boxes to
affects subsequent location detection. In order to improve the improve the convergence speed, the calculation formulas show
accuracy of the detector, the network hierarchy is gradually as (9). However, DIoU still ignores the last of three loss’s
deepened and widened, such as AlexNet (several layers), geometric elements, which includes overlap area, center point
VGGNet (more than a dozen layers), ResNet (hundreds of distance, and aspect ratio. Therefore, a complete IoU-CIoU
layers), etc [26][27][28]. The Inception series (V1 to V4) was proposed, which further improved the detection accuracy,
proposed by GoogLeNet reduces parameters while increasing the calculation formulas show as (10) [36]. Traditional loss
the width, which is a sparse network with high computing function ignored the uncertainty of the bounding box, aiming
performance [29]. Although the network performance has at it, KL Loss learns the bounding box regression and location
been continuously improved, it has difficulty in storage space inaccuracy simultaneously [37]. The obtained variance is used
and detection speed caused by the plenty of parameters. in the process of NMS by variance voting, so as to obtain a
In order to improve the efficiency and reduce the more accurate bounding box.
parameters without losing performance, lightweight models C G
are proposed, such as the SqueezeNet with a fire module to IoU
reduce the model parameters; the MobileNet with depthwise
C *G
(5)
separable convolution and pointwise convolution instead of IoU loss lnIoU
the standard convolution to reduce the parameters and increase
the speed; the Shuffle Net with groups convolutions and or 1 IoU
(6)
channel shuffle operations [30][31][32]. The NAS GIoU IoU A \ (C * G ) / A
automatically designs the backbone through neural structure
(7)
search. It first performs space searching process, and then
C: Candidate box
performs strategy searching process to find a suitable
G: Ground truth box
framework for performance evaluation [33]. At present, it has
A: the minimum external rectangle of C and G
played a very significant role in lightweight network design.
B. Location Accuracy Improvement LGIoU 1 GIoU (8)
The two major tasks of object detection are object U b, b
2 gt
classification and location, the loss function has an essential LDIoU 1 IoU 2
(9)
c
impact on the latter. Most current detectors use L1 , L2 or gt
Where, b, b are the center points of the predicted box and the
smooth L1
(introduced in Fast R-CNN) loss function to ground truth, and U represents the Euclidean distance; c is
classifyˈthe calculation formulas show as (1), (2) and (3). the diagonal distance of the minimum external rectangle of the
However, it is not advisable to evaluate the location accuracy two boxes.
metric IoU by the loss function which is not correlated with it. U 2 b, b gt
In addition, as shown in formula (4), when calculating the LCIoU 1 IoU DX (10)
c2
final bounding box regression loss function, the loss of four D represents the trade-off parameter and X is the
coordinates of the bounding box is added, that is, the four
consistency parameter of the aspect ratio.
coordinate points are assumed to be independent of each other.
C. Imbalance of Positive and Negative Samples
However, in fact, there is a certain correlation. Therefore,
In the two-stage detection, a relatively small number of
designing a new loss function is an important task to improve
object candidate boxes are generated in the first stage and then
the location accuracy.
classified and regressed in the second stage. A large amount of
dL x (1) negative samples are filtered. However, the single-stage
L1 x 2 2x
x detectors don’t have the stage that reduces candidate boxes,
L2 x 2 (2) making the samples too dense, among which the easy negative
1062
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
samples account for most of the total. They dominate the data have high location confidence [43]. In response to the lack of
and are not useful for model training, so solving the imbalance location confidence, IoU-Net proposed IoU-guided NMS [44].
of positive and negative samples is the key task of object Instead of classification confidence, it sets the IoU of the
detection. prediction box and ground truth as the ranking metric in the
In Faster R-CNN, the ratio of positive and negative process of NMS. And it introduces the location confidence to
samples is adjusted to 1: 3 for heuristic sampling. Considering improve the interpretability of the regression. Aiming at the
that it cannot completely solve the problem of sample difficulty of setting the IoU threshold, ConvNMS, a method of
imbalance, the OHEM algorithm calculates the loss for the using convolutional network for NMS was proposed, which
input samples, and automatically selects hard samples after obtains the best output by learning [45]. In order to solve the
sorting according to the loss [38]. However, it only retains the problem of dense detection, Adaptive NMS designed a
samples that have high loss and ignores easy samples directly. subnetwork to learn the object the denseness of the
To solve the above problem, a new loss function Focal Loss is surroundings, and sets the threshold for the IoU based on its
proposed, which is improved based on the cross entropy density, having achieved good results in actual detection [46].
function, making the model pay more attention on hard E. Multi-scale Detection
samples by adjusting the weight of the positive and negative Object detection has always faced difficulties of
samples, he calculation formulas show as (11) [17]. And a muti-scales objects detection. In order to optimize the
detector-RetinaNet is designed to verify its effectiveness. detection effect of multi-scale objects, SSD detect
However, Focal Loss has two shortcomings: it is difficult to different-scale objects in different convolution layers, and the
adjust the hyper-parameters and cannot adapt with data resulting multi-scale feature maps generate the last layer
distribution. GHM sets the inverse of the gradient density as collectively. But the architecture is not ideal for small objects.
the weight of the classification and regression loss function, In order to solve this problem, FPN, a network proposes a
and alleviates the imbalance from the perspective of gradient structure with top-down pathway and lateral connections. It
informatioˈ the calculation formulas of gradient density show constructs strong semantic feature maps on different scales but
as (12) [39]. Recently, a method is proposed, which alleviate makes the computation increased. In response to the impact of
category imbalances with a ranking evaluation metric AP-Loss the muti-scale on the detection, SNIP ignores oversized and
instead of a classification loss [40]. The AP-Loss is optimized undersized objects and builds image pyramids of different
by the error-driven update mechanism because it is sizes, where each image is only detected for a specified scale
non-differentiable and non-convex. range, and finally merged to produce results by NMS, but its
FL pt D t 1 pt log( pt )
J training speed is slow [47]. In order to reduce the amount of
(11)
computation and make multi-scale object detection more
1 N
GD ( g ) ¦ G g k , g
l k 1
(12) accurate, SNIPER retains the scale invariance of R-CNN and
the speed of the Fast series at the same time [48]. Each scale is
Where: g is the gradient norm, p * is the label of positive and sampled to 512 × 512 and divided into Positive Chips and
negative sample, if it's a positive sample it's going to be 1. Negative Chips. After background filtering, the number of
D. Optimization of NMS pixels to be trained is reduced, and the speed is greatly
No matter whether it is a single-stage or two-stage improved. The MFLPN mentioned above obtains multi-level
detector, the filter of candidate boxes plays a vital role in and multi-scale features through three parts: feature fusion
alleviating the computation stress. Non-maximum suppression module FFM, refinement U-shaped module TUM, and scale
(NMS) is used in the post-processing of detection to remove feature aggregation module SFAM, which improves the
redundant bounding boxes. Traditional NMS selects the boxes accuracy of multi-scale object detection.
with the highest confidence scores and continuously performs V. DETECTORS’ PERFORMANCE COMPARISON
IoU operations with other boxes, in which the box will be
discarded as long as the IoU exceeds a preset threshold [41]. This paper introduces the single-stage and two-stage
Although this method has been widely used in many detectors, detection algorithms of Anchor-based in detail. Both of them
but it automatically ignores overlapping objects. The detection have their advantages and disadvantages. The two-stage
accuracy is not high when the objects in the scene are dense, detectors have high detection accuracy, but the detection speed
and the IoU threshold is not easy to set. is slow. The single-stage detectors have the advantage in
In order to avoid the shortcomings of the crude NMS, its detection speed, but the detection accuracy is low. Therefore,
improved version, Soft-NMS no longer ignores the it is a momentous task for the research of object detection to
overlapping boxes, but uses a function to reduce the balance the real-time and accuracy. Besides, the key-point
confidence score [42]. However, the above two methods are based methods and dense prediction methods of Anchor-free
judged based on the classification confidence score, and do module are summarized. Table I summarizes the performance
not take the location accuracy of the bounding box into of Anchor-based and Anchor-free detectors in Pascal
account. In other words, the class confidence and the location VOC2007 and MSCOCO datasets, including the accuracy
confidence are non-positive correlation. In order to improve mAP (mean average precision) and detection speed FPS
the location accuracy, Softer-NMS weighted averaging the (frame per second). Where "-" means no relevant data.
bounding boxes, making the boxes of high confidence scores
1063
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
TABLE I What’s more, research on object detection technology may
THE PERFORMANCE COMPARISON OF DETECTORS
have the following trends in the future:
VOC07 COCO Speed Although the accuracy of the detectors is gradually
Module Detector Backbone
mAP/% AP /% fps improved, the complexity of the model affects the detection
R-CNN AlexNet 58.5 - 0.02 speed. In order to improve the efficiency with limited
computing power and memory, and to make it play a more
SPP-Net ZF-5 59.2 - 2 stable performance in applications such as mobile devices,
Anchor-
Fast
VGG16 70.0 19.7 0.4 smart cameras, and autonomous driving, the development of
Based R-CNN lightweight detectors is an important direction.
Faster
Two- R-CNN
VGG16 78.8 21.9 5 B. Combination of single-stage and two-stage detectors
The two-stage detectors have the process of generating
stage R-FCN ResNet-101 79.5 29.9 6
object candidate boxes. The overall accuracy is higher but the
ResNeXt101
Mask detection speed is slower. The single-stage algorithm has
- - 37.1 5
R-CNN faster detection speed but lower accuracy, especially for small
FPN
YOLO v1 Google Net 63.4 - 45
objects and dense scenes. How to combine the advantages of
the two modules of detectors to make them take into account
YOLO v2 Darknet19 78.6 21.6 40 both high accuracy and real time is an important challenge.
YOLO v3
Darknet53 - 31 35 C. Anchor-Free module
Anchor- 416×416 In order to avoid Anchor's non-adaptability of different
YOLO v4 CSPDarknet
Based 416×416 53
- 41.2 38 detection and the impact on model migration capabilities
SSD because it should be predefined, scholars have proposed the
Single- VGG16 79.8 28.8 22
512×512 method that guides anchor generation by image features, and
DSSD anchor-free module based on key points and dense prediction.
stage ResNet-101 81.5 33.2 6
500×500 It's expected that the flexibility of this new module can bring
ResNet-101-
RetinaNet
FPN
73.2 39.1 5 new ideas.
M2Det D. Weakly supervised detection
VGG-16 - 41 11.8 Training of the object detectors often requires a large
800
Anchor- Centernet ResNet-101 78.7 47.0 - number of labeled images. However, the labeling process
Hourglass-
depends on plenty of computation and time, so an algorithm
free CornerNet - 42.1 - can be trained to use a small amount of weak label data to map
104
Keypoint the input data of stronger labels. Weakly supervised detection
Extreme-N
et
Hourglass - 40.5 4.4 methods that reduce labor costs and improve detection
Based
efficiency are topics that require in-depth research.
Anchor- ResNet101- E. Small object detection
FCOS - 41.0 -
FPN
Small object detection has always been a problem for
free Foveabox ResNet-101 - 40.6 - deep learning detection. It has fewer pixels and features are
Dense
ResNet101-
not obvious, requiring high resolution. The use of scaled
FSAF - 44.6 - images or multi-layer feature maps to detect small objects has
predict FPN
a certain degree of effect Optimization. The future trends in
this direction may focus on the development of
CONCLUDING REMARKS AND PROSPECT super-resolution lightweight networks and the more efficient
A. Lightweight detector use of contextual semantic information.
With the in-depth study of deep learning technology in F. Object detection in video
the field of object detection in recent decades, an increasing The motion, deformation, and occlusion of the object in
number of new theories and methods are emerging. This the video will bring difficulties to the detection. The
article summarizes the development of object detection single-frame method divides the video into independent
research based on deep learning, and specifically explains the frames and ignores the correlation between adjacent frames.
network structure, implementation process, advantages and Therefore, how to make use of object timing information and
disadvantages of each representative detector, which belongs context information is the key to improve the video detection
to two models of Anchor-Based and Anchor-Free. Aiming at effect.
the two important performances that the detector should have ACKNOWLEDGMENT
in practical application: accuracy and real time, many scholars This work was supported by the research and
have proposed a series of directions to optimize the effect of Development Projects in Key Areas of Guangdong Province
the detector. Such as the optimization of backbone, under Grant Nos 2019B090922002.
improvement of location accuracy, multi-scale detection,
optimization of NMS, etc. We summarize the development of REFERENCES
the key technology of object detection from the above aspects. [1] P. Viola, and M. J. Jones, "Robust real-time face detection," in
International journal of computer vision, 2004, pp. 137-154.
1064
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.
[2] N. Andrew, and L. D. Griffin, "Multiscale histogram of oriented gradient 2014.
descriptors for robust character recognition," in 2011 International [28] K. -M. He, et al. "Deep Residual Learning for Image Recognition." in
Conference on Document Analysis and Recognition. IEEE, 2011ˈpp. Proceedings of the IEEE Conference on Computer Vision and Pattern
1085-1089. Recognition, 2016, pp. 770-778.
[3] Suykens, and J. Vandewalle, "Least squares support vector machine [29] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of
classifiers," Neural processing letters 9.3 ,1999, pp. 293-300. the IEEE Conference on Computer Vision and Pattern Recognition,
[4] P. Felzenszwalb, D. McAllester, and D. Ramanan, "A discriminatively 2015, pp. 1-9.
trained, multiscale, deformable part model," in 2008 IEEE Conference [30] F. N. Iandola, et al, "SqueezeNet: AlexNet-level accuracy with 50x
on Computer Vision and Pattern Recognition. IEEE, 2008, pp. 1-8. fewer parameters and< 0.5 MB model size," arXiv preprint
[5] R. Girshick, et al, “Rich feature hierarchies for accurate object detection arXiv:1602.07360, 2016.
and semantic segmentation,” in Proceedings of the IEEE conference on [31] A. G. Howard, et al, "Mobilenets: Efficient convolutional neural
computer vision and pattern recognition, 2014, pp. 580– 587. networks for mobile vision applications," arXiv preprint
[6] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep arXiv:1704.04861, 2017.
convolutional networks for visual recognition," in European conference [32] X. Zhang, X. Zhou, M. Lin, and J. Sun, "Shufflenet: An extremely
on computer vision. Springer, 2014, pp. 346-361. efficient convolutional neural network for mobile devices," in
[7] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international Proceedings of the IEEE conference on computer vision and pattern
conference on computer vision, 2015, pp. 1440-1448 recognition, 2018, pp. 6848-6856.
[8] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time [33] B. Zoph and Q. V. Le, "Neural architecture search with reinforcement
object detection with region proposal networks," in Advances in neural learning." arXiv preprint arXiv:1611.01578, 2016.
information processing systems, 2015, pp. 91-99. [34] J. Yu, et al, "Unitbox: An advanced object detection network," in
[9] J. Dai, Y. Li, K. He, and J. Sun, "R-fcn: Object detection via region-based Proceedings of the 24th ACM international conference on Multimedia.
fully convolutional networks," in Advances in neural information ACM, 2016, pp. 516-520.
processing systems, 2016, pp. 379-387. [35] H. Rezatofighi, et al, "Generalized intersection over union: A metric and
[10] K. He, et al, "Mask r-cnn," in Proceedings of the IEEE international a loss for bounding box regression," in Proceedings of the IEEE
conference on computer vision, 2017, pp. 2980-2988 Conference on Computer Vision and Pattern Recognition, 2019, pp.
[11] J. Redmon, et al, "You only look once: Unified, real-time object 658-666.
detection," in Proceedings of the IEEE conference on computer vision [36] Z. -H. Zheng, et al, "Distance-IoU Loss: Faster and Better Learning for
and pattern recognition, 2016, pp. 779-788. Bounding Box Regression," arXiv preprint arXiv:1911.08287, 2019.
[12] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in [37] Y. -H. He, et al, "Bounding box regression with uncertainty for accurate
Proceedings of the IEEE conference on computer vision and pattern object detection," in Proceedings of the IEEE Conference on Computer
recognition, 2017, pp. 6517-6525. Vision and Pattern Recognition, 2019, pp. 2888-2897.
[13] J. Redmon and A. Farhadi, "Yolov3: An incremental [38] A. Shrivastava, A. Gupta, and R. Girshick, "Training region-based object
improvement," arXiv preprint arXiv:1804.02767, 2018. detectors with online hard example mining," in Proceedings of the IEEE
[14] B. Alexey , C. -Y. Wang , and H. -Y. Liao, "YOLOv4: Optimal Speed conference on computer vision and pattern recognition, 2016, pp.
and Accuracy of Object Detection," arXiv preprint arXiv:2004.10934. 761-769.
[15] W. Liu, et al, "Ssd: Single shot multibox detector," in European [39] B. -Y. Li, Y. Liu, and X.-G. Wang, "Gradient harmonized single-stage
conference on computer vision. Springer, Cham, 2016, pp. 21-37. detector," in Proceedings of the AAAI Conference on Artificial
[16] C.-Y. Fu, et al, "Dssd: Deconvolutional single shot detector," arXiv Intelligence, 2019, pp. 8577-8584.
preprint arXiv:1701.06659, 2017. [40] K. Chen, et al, "Towards accurate one-stage object detection with
[17] T. -Y. Lin, et al, "Focal loss for dense object detection," in Proceedings AP-loss," in Proceedings of the IEEE Conference on Computer Vision
of the IEEE international conference on computer vision, 2017, pp. and Pattern Recognition, 2019, pp. 5119-5127.
318-327. [41] Neubeck, Alexander, and Luc Van Gool. "Efficient non-maximum
[18] J. -Q. Wang, et al, "Region proposal by guided anchoring," suppression." in Proceedings of the IEEE Conference on Computer
in Proceedings of the IEEE Conference on Computer Vision and Pattern Vision and Pattern Recognition, 2006, pp. 850-855.
Recognition, 2019, pp. 2965-2974. [42] N. Bodla, et al, "Soft-NMS--improving object detection with one line of
[19] Q. -J. Zhao, et al, "M2det: A single-shot object detector based on code," in Proceedings of the IEEE international conference on computer
multi-level feature pyramid network," in Proceedings of the AAAI vision, 2017, pp. 5562-5570.
Conference on Artificial Intelligence, vol. 33, 2019, pp. 9259-9266. [43] Y. -H. He, et al, "Softer-nms: Rethinking bounding box regression for
[20] H. Law, and J. Deng, "Cornernet: Detecting objects as paired accurate object detection," arXiv preprint arXiv:1809.08545, 2018.
keypoints," in Proceedings of the European Conference on Computer [44] B. Jiang, et al, "Acquisition of localization confidence for accurate
Vision (ECCV), 2018, pp. 765-781. object detection," in Proceedings of the European Conference on
[21] K. -W. Duan, et al, "Centernet: Keypoint triplets for object detection," in Computer Vision (ECCV), 2018, pp. 816-832.
Proceedings of the IEEE International Conference on Computer Vision, [45] J. Hosang, B. Rodrigo, and S. Bernt, "A convnet for non-maximum
2019, pp. 6568-6577 suppression," in German Conference on Pattern Recognition. Springer,
[22] X. -Y. Zhou, J. -C. Zhuo, and P. Krahenbuhl, "Bottom-up object Cham, 2016, pp. 192-204.
detection by grouping extreme and center points," in Proceedings of the [46] S. -T. Liu, H. Di, and Y. -H Wang, "Adaptive nms: Refining pedestrian
IEEE Conference on Computer Vision and Pattern Recognition, 2019, detection in a crowd," in Proceedings of the IEEE Conference on
pp. 850-859. Computer Vision and Pattern Recognition, 2019, pp. 6459-6468.
[23] Tian, Zhi, et al. "Fcos: Fully convolutional one-stage object [47] B. Singh, and L. S. Davis, "An analysis of scale invariance in object
detection," in Proceedings of the IEEE International Conference on detection snip," in Proceedings of the IEEE conference on computer
Computer Vision, 2019, pp. 9626-9635. vision and pattern recognition, 2018, pp. 3578-3587.
[24] Kong, Tao, et al, "Foveabox: Beyond anchor-based object [48] B. Singh, M. Najibi, and L. S. Davis, "SNIPER: Efficient multi-scale
detector," arXiv preprint arXiv:1904.03797, 2019. training," in Advances in neural information processing systems, 2018,
[25] Zhu, Chenchen, Yihui He, and Marios Savvides. "Feature selective pp. 9333-9343.
anchor-free module for single-shot object detection." Proceedings of the [49] Z. -X. Zou, Z. -W. Shi, Y. -H. Guo, J. -P. Ye, "Object Detection in 20
IEEE Conference on Computer Vision and Pattern Recognition. 2019, Years: A Survey," arXiv preprint arXiv:1905.05055, 2019.
pp. 840-849. [50] L. Jiao et al, "A Survey of Deep Learning-Based Object Detection,"
[26] K. Alex, I. Sutskever, and G. Hinton, "ImageNet Classification with IEEE Access, 2019, pp. 128837-128868.
Deep Convolutional Neural Networks," in neural information processing [51] X. -W. Wu, D. Sahoo, S. C. H. Hoi, "Recent Advances in Deep Learning
systems, 2012, pp. 1106-1114. for Object Detection," arXiv preprint arXiv:1908.03673, 2019.
[27] S. Karen, and A. Zisserman, "Very Deep Convolutional Networks for [52] K. Oksuz, et al, "Imbalance Problems in Object Detection: A Review,"
Large-Scale Image Recognition," arXiv preprint arXiv: 1409.1556, arXiv preprint arXiv:1909.00169, 2019.
1065
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 24,2021 at 08:41:29 UTC from IEEE Xplore. Restrictions apply.