2208.02019v2
2208.02019v2
2208.02019v2
Detector
Ziping Yu1 , Hongbo Huang∗2 , Weijun Chen3 , Yongxin Su4 , Yahui Liu5 , and Xiuying
Wang2
1
School of Instrument Science and Opto-electronic Engineering, Beijing Information Science and Technology
arXiv:2208.02019v2 [cs.CV] 4 Aug 2022
Abstract
In recent years, face detection algorithms based on deep learning have made
great progress. These algorithms can be generally divided into two categories,
i.e. two-stage detector like Faster R-CNN and one-stage detector like YOLO. Be-
cause of the better balance between accuracy and speed, one-stage detectors have
been widely used in many applications. In this paper, we propose a real-time face
detector based on the one-stage detector YOLOv5, named YOLO-FaceV2. We
design a Receptive Field Enhancement module called RFE to enhance receptive
field of small face, and use NWD Loss to make up for the sensitivity of IoU to
the location deviation of tiny objects. For face occlusion, we present an attention
module named SEAM and introduce Repulsion Loss to solve it. Moreover, we use
a weight function Slide to solve the imbalance between easy and hard samples and
use the information of the effective receptive field to design the anchor. The ex-
perimental results on WiderFace dataset show that our face detector outperforms
YOLO and its variants can be find in all easy, medium and hard subsets. Source
code in https://github.com/Krasjet-Yu/YOLO-FaceV2
1 Introduction
Face detection is an essential step in many face-related applications, such as face recognition,
face verification and face attribute analysis, etc. With the booming of deep convolutional
neural networks in recent years, the performance of face detectors has been greatly improved.
Many high-performance face detection algorithms based on deep learning have been proposed.
Generally, these algorithms can be divided into two branches. One branch of typical deep-
learning-based face detection algorithms [1, 2, 3] uses cascading means of neural networks as
∗
Corresponding Author: hhb@bistu.edu.cn
1
feature extractors and classifiers to detect faces from coarse to fine. Despite their great success,
it is important to note that cascade detectors suffer some drawbacks such as having difficulties
in training and slow detection speed. The other branch is improved from general purpose
object detection algorithms [4, 5, 6]. General purpose object detectors take into account more
common features and broader characteristics of objects. Therefore, task-specific detectors can
share these information and then enforce the spectacular properties by special designs. Some
popular face detectors including YOLO [7, 8, 9, 10], Faster R-CNN [5] and RetinaNet [6] fall
into this category. In this paper, inspired by YOLOv5 [11], TridentNet [12] and Attention
Network in FAN [13], we propose a novel face detector that achieves the state-of-the-art in
one-stage face detection.
Although deep convolutional networks have improved face detection remarkably, detecting
faces with high variance in scale, pose, occlusion, expression, appearance, and illumination
in realistic scenes remains great challenge. In our previous work, we proposed the YOLO-
Face [14], an improved face detector based on YOLOv3 [9], which mainly focused on the
problem of scale variance, design anchor ratios suitable for human face and utilized a more
accurate regression loss function. The mAP of Easy, Medium, and Hard on the WiderFace [15]
validation set reached 0.899, 0.872, and 0.693, respectively. Since then variety of new detectors
have been presented and the face detection performance has been significantly improved.
However, for small objects, the one-stage detectors have to divide the search space with a finer
granularity, so it is apt to cause the problem of imbalance of positive and negative samples
[16]. Furthermore, face occlusions [13] in complex scenes affects the accuracy of the face
detector remarkably. Aimed to address the problems of varying face scales, easy and hard
sample imbalance and face occlusion, we propose a YOLOv5-based face detection method
called YOLO-FaceV2.
By carefully analyzing the difficulties encountered by face detectors and the shortcomings
of YOLOv5 detector, we carry out the following solutions.
Multi scale fusion: In many scenarios, there are usually different scale faces existing in
the images, which is really difficult for them all to be detected by the face detector. Therefore,
solving different scale faces is a very important task for face algorithms. Currently, the main
method to solve the problem of varying scales is constructing a pyramid to fuse the multi-scale
features of faces [17, 18, 19, 20]. For example, in YOLOv5, FPN [20] fuses the features of P3,
P4 and P5 layers. However, for small-scale objects, the information can be easily lost after
multi-layer convolutions, and the pixel information retained is very little, even in the shallower
P3 layer. Therefore, increasing the resolution of the feature map can undoubtedly benefit the
detection of small objects.
Attention mechanism: In many complex scenes, face occlusion often occurs, which is
one of the main reasons for the accuracy decline of face detectors. To address this problem,
some researchers try to use attention mechanism to facial feature extraction. FAN [13] proposes
a anchor-level attention. They suggest that the solution is to maintain the response value of
the unobstructed region and to compensate the reduced response value of the obscured region
through the attention mechanism. However, it doesn’t fully utilize the information between
channels.
Hard Samples: In one-stage detectors, many bounding boxes are not been filtered out
iterately. So the number of easy samples in one-stage detectors is very large. During training,
their cumulative contribution dominates the update of the model, leading to the overfit of
the model [16]. This is known as the problem of imbalanced samples. To deal with this
problem, Lin et al. proposes Focal Loss to dynamically assign more weights to difficult sample
examples [6]. Similar to focal loss, Gradient Harmonizing Mechanism (GHM) [21] suppresses
the gradients from positive and negative simple samples to focus more on difficult samples.
Prime Sample Attention (PISA) [22] proposed by Cao et al. assigns weights to positive and
2
negative samples according to different criteria. However, current hard sample mining methods
have too many hyperparameters to be set, which is very inconvenient in practice.
Anchor design: As pointed out in [23] a region in a CNN feature map has two types
of receptive fields, the theoretical receptive field and the actual receptive field. It is experi-
mentally shown that not all pixels in the receptive field respond equally, but obey a Gaussian
distribution. This makes the anchor size based on the theoretical receptive field larger than
its actual size, which makes it more difficult for the regression of bounding boxes. Zhang et.
al designs the size of the anchors based on the effective receptive field in S 3 F D [24]. And
FaceBoxes [25] designs the multiscale anchor to enrich the receptive fields and discretize an-
chors over different layers to handle faces of various scales. Therefore, the design of scales
and ratios of the anchor boxes is very important which may greatly benefits the accuracy and
convergence procedure of the model.
Regression Loss: Regression loss is used to measure the difference between the pre-
dicted bounding box and the ground truth bounding box. The commonly used regression
loss functions in object detectors are L1/L2 loss, smooth L1 loss, IoU loss and its variants
[26, 27, 28, 29]. YOLOv5 takes IoU loss as its objective regression function. However, the
sensitivity of IoU varies greatly for objects of different scales. It is readily comprehensible
that, for small targets, a slight position deviation leads to a significant IoU decrease. Wang et
al. [30] proposes a small target evaluation method based on Wasserstein distance to effectively
mitigate the effect of small target. However, their method performs not so significant for large
targets.
In this paper, to address the aforementioned problems, we design a new face detector based
on YOLOv5. Our aim is to find an optimal combinatorial detector that effectively solves the
problems of small faces, large scale variations, occluded scenes and imbalanced hard and easy
samples. First, we fuse P2 layer information of FPN to obtain more pixel-level information and
compensate the information of small face. However, in this way, the detection accuracy of large
and medium targets will be slightly reduced because the output feature map perceptual field
becomes smaller. To ameliorate this situation, we design Receptive Field Enhancement (RFE)
for the P5 layer, which increases the receptive field by using dilated convolution. Second, in-
spired by FAN and ConvMixer [31], we redesign a multi-head attention network to compensate
for the loss of occluded face response values. In addition, we also introduce Repulsion Loss [32]
to improve the recall of intra-class occlusions. Third, to mine hard samples, inspired by ATSS
[33], we design the Slide weight function with adaptive thresholding to make the model focus
more on hard samples during training. Fourth, in order to make the anchor more suitable
for regression, we redesign the anchor size and proportion according to the effective receptive
field and the proportion of the face. Fifth, we borrowed the Normalized Wasserstein Distance
metric [30] and introduced it into the regression loss function to balance the shortage of IoU
in predicting small faces.
In summary, we propose a new face detector YOLO-FaceV2, in which the highlighted
contributions are as follows.
1. For detecting multiscale faces, the perceptive field and resolution are key factors. There-
fore, we design a receptive field enhancement module (called RFE) to learn different receptive
fields of the feature map and enhance the feature pyramid representation.
2. We classify the face occlusions into two categories, i.e., the occlusion between different
faces, and the occlusion of faces by other objects. The former makes the detection accuracy
very sensitive to NMS thresholds which leads to missed detection. We use Repulsion Loss
to face detection which penalizes the predicted box for shifting to the other ground-truth
objects and requires each predicted box to keep away from the other predicted boxes with
different designated targets to make the detection results less sensitive to NMS. The latter
causes feature disappearance leading to inaccurate localization, and we design the attention
3
module SEAM to enhance the learning of face features.
3. To address the problem of imbalance between hard and easy samples, we weight the easy
and hard samples according to the IoU. To reduce hyperparameter tuning, we set the mean
value of IoU of all candidate positive samples with ground-truth as the dividing line between
positive and negative samples. And we design a weighted function named Slide to give higher
weight to hard samples which is helpful for the model to learn more difficult features. The
details of this function will be presented in sections 3-5.
The rest of the paper is arranged as follows: in Section 2 we review the related literature
in this area; in Section 3 we describe the model structure in detail, and the main improvisions
including the receptive field enhancement module, the attention module, the adaptive sam-
ple weighting function, the anchor design, the Replusion Loss and the Normalized Gaussian
Wasserstein Distance (NWD) Loss, respectively; in Section 4 we describe the experiments and
the according analysis of the results, including ablation experiments and comparisons with
other models; and in Section 5 we summarize our work and give some advice about future
research.
2 Related Works
Face Detection. Face detection has been a hot research area in computer vision for decades.
In the early years of deep learning, face detection algorithms usually use neural networks to
automatically extract image features for classification. CascadeCNN [1] proposes a cascaded
structure with three stages of carefully designed deep convolutional networks that predicts
face and landmark location in a coarse-to-fine manner. MTCNN [2] develops a similar cascade
architecture to jointly align the face landmarks and detect the face locations. PCN [3] uses an
angle prediction network to correct faces and improve the face detection accuracy. But early
deep-learning-based face detection algorithms have some drawbacks such as tedious training,
local optimum, slow detection speed, and low detection accuracy, etc.
Current face detection algorithms are mainly improved by inheriting the advantages of generic
object detection algorithms, such as SSD [4], Faster R-CNN [5], RetinaNet [6], etc. CMS-
RCNN [34] uses Faster R-CNN as backbone and introduces contextual information and multi-
scale features to detect faces. Zhang et al. [25] designs a lightweight network based on SSD
structure, named FaceBoxes, to quickly shrink the feature size by 32x down-sampling, and
uses a multi-scale network module to enhance the features in both network width and depth
dimensions. SRN [35], which is improved on the generic object detection algorithm RefineDet
[36] and RetinaNet [6], achieves high performance by introducing two-stage classification and
regression, and designs a multi-branch module to enhance the effect of receptive fields.
Scale-invariance. As one of the most challenging problems in face detection, large face scale
variations in complex scenes has an important impact on the accuracy of the detector. The
multi-scale detection capability mainly depends on the scale-invariance features, and many
works address this problem to extract features more accurately and effectively [13, 24, 37, 38].
For small objects detection, using fewer down-sampling layers and dilated convolution can
significantly improve the detection performance [39, 40]. Another way to bridge this problem
is using more anchors. Anchor can provide good priori information, thus using denser anchors
and corresponding matching strategies can effectively improve the quality of object proposals
[24, 25, 37, 40]. Multi-scale training can be helpful to construct the image pyramids and in-
crease the sample diversity, which is a simple but effective method to improve the performance
of multi-scale object detection. On the other hand, the receptive fields will increase and the
semantic information get richer accordingly, however, the spatial information may be missing
correspondingly. A natural idea is to fuse deep semantic information with shallow features,
4
such as [20, 41, 42]. Besides, SNIP [43] and TridentNet [12] also provide new ideas to solve
the multi-scale problem, which will be discussed in detail in the following sections.
Occlusion problem. Crowding faces and the following occlusion problem give rise to partial
data and lack of information about the occluded faces, because some regions are invisible or the
boundaries are blurred, which can easily cause missed detection and low recalls. Some works
have demonstrated that contextual information is helpful for face detection to alleviate the
occlusion problem. SSH [37] uses means of simple convolution layers to incorporates context
by enlarging the window around the candidate proposals. FAN [13] proposes an anchor-level
attention to detect the occluded faces by highlighting the features from the face region. Pyra-
midBox [44] designs a context-sensitive predict module in which they replace the convolution
layers of context module in SSH by the residual prediction module of DSSD. RetinaFace [45]
applies independent context modules on five feature pyramid levels to increase the receptive
field and enhance the rigid context modelling power. The above methods have achieved good
results in the occlusion problem. Therefore, using context information to improve the effec-
tiveness of the occluded regions is a feasible direction which is worth further exploration.
Imbalance of easy and hard samples. For one-stage face detection, the number of easy
samples is very large, and they dominate the variation of losses so that the model can only
learn the features of easy samples and ignores the learning of hard samples. To address this
problem, the OHEM [46] algorithm selects the difficult samples according to the sample loss
and applies the loss of difficult samples to the training in stochastic gradient descent. In re-
sponse to the problem of ignoring easy samples in the OHEM algorithm, Focal Loss [6] makes
better use of all samples by weighting them and obtains higher accuracy. This idea is also fol-
lowed by SRN [35]. Faceboxes [25] sorts the samples according to their IoU loss, and controls
the ratio of positive to negative samples to be less than 1:3. Although the above methods
can effectively solve the problem of sample imbalance, they also artificially introduce some
hyperparameters, which increase the difficulty of adjusting. Therefore, we design a sample
balance function with adaptive parameters.
3 YOLO-FaceV2
3.1 Network Architecture
YOLOv5 is an excellent general object detector. We introduce YOLOv5 into the face detection
field and try to solve the problems of small faces and face occlusion, etc.
The architecture of our YOLO-FaceV2 detector is shown in Figure 1. It consists of three
parts: the backbone structure, the neck and the heads. We take CSPDarknet53 as our back-
bone and replace the Bottleneck with RFE module in P5 layer to fuse multi-scale features. In
the neck part, we maintain the structure of SPP [47] and PAN [48]. In addition, in order to
improve the ability of target position perception, we also integrate the P2 layer into the PAN.
The heads are used to classify the category and regress the location of the target. We also
add a special branch into the heads to enhances the model’s ability of occlusion detection.
In Figure 1 (a), the red part on the left is the backbone of the detector, which is composed
of CSP blocks and CBS blocks. It is mainly used to extract the features of the input images.
And the RFE module is added to expand the effective receptive field and enhance the fusion
capability of multi-scale in the P5 layer. In Figure 1 (b), the blue and yellow parts on the
right are called neck layers, which consists of SPP and PAN. We additionally fuse the features
of the P2 layer to improve the ability of more accurate target localization. In Figure 1 (c),
we introduce the separated and enhancement attention module(SEAM) to strengthen the
responsiveness of occluded faces after the output part of the neck layer.
5
Figure 1: Network architecture of YOLO-FaceV2. (a) Backbone: a feed-forward CSP-
Darknet53 architecture extracts the multi-scale feature maps. To expand the receptive
field, the CSP block in P5 is replaced by the receptive field enhancement module (RFE)
which is shown in the blue dotted box. (b) Neck: a Spatial Pyramid Pooling (SPP)
separates out the most significant context features and increases the receptive field
and a Path Aggregation Network (PAN) aggregates parameter from different backbone
levels for different detector levels. To compensate for increased receptive field loss of
resolution, the P2 layer is fused into PAN which is shown between (a) and (b). (c) a
Separated and Enhancement Attention Module (SEAM) uses the relationship between
feature maps to recall occluded features which is shown in the red dotted box.
6
Figure 2: Modified CSP block and RFE module. For CSP block in P5, we replace
bottleneck with RFE. The right figure shows the detailed architecture of the RFE. It
consists of 1x1 convolution, 3x3 convolution with different dilated rate and average
pooling layer.
where
− ln(1 − x) x≤σ
Smoothln = x−σ (2)
1−σ − ln(1 − σ) x > σ
P in the formula is the face prediction frame, GPRe p is the ground truth with the largest
IoU around the face. The overlap between P and GPRe p is defined as intersection over ground
∩G)
truth (IoG): IoG(P, G) = area(P
area(G) and IoG(B, G) ∈ [0, 1]. Smoothln is (0, 1) a continuously
differentiable. ln function, σ ∈ [0, 1) is a smoothing parameter to adjust the sensitivity of
repulsion loss to outliers.
The purpose of RepBox loss is to make the prediction frame as far away from the surround-
ing prediction frame as possible and reduce the IOU between them, so as to avoid one of the
prediction frames belonging to two faces being suppressed by NMS. We divide the prediction
frame into multiple groups. Assuming there are g individual faces, the division form is shown
in Eqn 3. The prediction frames between the same groups return to the same face label, and
the prediction frames between different groups correspond to different face labels.
ρ+ = ρ1 ∩ ρ2 ∩ . . . ∩ ρ|g| (3)
7
Then, for the prediction box between different groups pi and pj , we hope that we get the
smaller the overlap area between pi and pj . RepBox also uses Smoothln as an optimization
function. The overall loss function is as follows:
pi pj
P
i6=j Smoothln (IoU (B , B ))
LRepBox = P pi pj (4)
i6=j 1 [IoU (B , B ) > 0] +
8
Figure 3: Illustration of SEAM. The left is the architecture of the SEAM, the right part
is the structure of CSMM(channel and spatial mixing module). The CSMM utilizes
different patch to multi-scale feature and uses depth separable convolution to learn the
correlation of spatial dimensions and channels.
9
Figure 4: We propose a novel loss we term the Slide Loss that adaptively learns the
positive and negative sample threshold parameters µ. Setting high weights near µ
increases the relative loss for hard-classified examples, putting more focus on hard,
misclassified examples.
Figure 5 (b)). Therefore, we redesigned the initial anchor size as shown in the Table 1.
Figure 5: (a) Effective receptive field: The whole black box is the theoretical recep-
tive field (TRF) and the while circle with Gaussian distribution is the effective receptive
field (ERF). The figure is from [23]. (b) A special example: The whole box is the
origin anchor box setting by TRF and the blue box is the new anchor estimated by the
red circle which is the ERF.
10
object detection. p !
W22 (Na , Nb )
N W D (Na , Nb ) = exp − (6)
C
T ! 2
wb hb T
wa ha
W22 (Na , Nb ) = cxa , cya , , , cxb , cyb , · (7)
2 2 2 2
2
Where C is a constant closely related to the data set, W22 (Na , Nb ) is a distance measure,
and Na and Nb are Gaussian distributions modeled by A = (cxa , cya , wa , ha ) and B =
(cxb , cyb , wb , hb ).
4 Experiments
In this part, we conduct comprehensive ablations of our proposed method, including the
effectiveness of our attention module, multi-scale fusion pyramid structure and loss function
design. Then, we compare the performances between our proposed detector and other SOTA
face detectors.
4.1 Dataset
We evaluated our model on the WiderFace dataset, which has 32203 images, including more
than 400k faces. It consists of three parts: 40% for training set, 10% for verification set and
50% for test set. The results of training set and verification set can be obtained from the
official website of WiderFace. According to the difficulties, the dataset can be divided into
three parts: easy, medium and hard. Among them, hard subset is the most challenging, and
its performance can better reflect the effectiveness of face detector. We trained our model on
the WiderFace training set and evaluated it on the validation set and test set.
4.2 Training
We use YOLOv5 as our baseline and the methods are implemented by PyTorch.The optimizer
we used is SGD with momentum. The initial learning rate is set to 1e-2, the final learning rate
is 1e-3, and the weight decay is set to 5e-3. A momentum 0.8 is used in first 3 warming-up
epochs. After that, the momentum is 0.937. The IoU for the NMS is set to 0.5. We train the
model on 1080ti which have 4 CPU workers. The fine-tuning consumes 100 iterations with a
batch size of 16 images.
11
Table 2: Ablation study results on the WiderFace validation dataset
NWD Params Flops
SEAM PAN+P2 RFE Slide Anchor RPLoss Easy Medium Hard
Loss (M) (G)
√ 94.65 93.00 83.30 7.063 16.4
√ 95.53 93.82 84.36 7.464 17.1
√ √ 93.67 92.14 83.87 6.101 17.1
√ 95.06 93.60 85.47 5.097 17.1
√ 95.13 93.41 83.67 - 17.1
√ 94.89 93.75 84.20 - -
√ 94.62 92.87 83.31 - -
√ √ √ 95.27 93.63 83.80 - -
√ √ √ √ 95.06 93.64 85.57 5.201 17.9
√ √ √ √ √ 95.34 93.85 85.66
√ √ √ √ √ √ 96.22 94.79 85.82
√ √ √ √ √ √ √ 96.30 94.99 85.94
98.78 97.39 87.75 18.2
12
Table 3: Comparison of our YOLO-FaceV2 and existing face detector on the WiderFace
validation dataset.
Table 4: Comparison of our YOLO-FaceV2 and existing face detector on the WiderFace
validation dataset.
13
(a) Easy
(b) Medium
(c) Hard
Figure 6: Detection results on the WiderFace validation dataset. (a): results on the
‘Easy’ dataset (b): results on the ‘Medium’ dataset (c): results on the ‘Hard’ dataset
14
We mainly compare with various excellent face detectors presented recently. Table 4 is
classified according to face detectors based on different general detectors, such as fast RCNN,
SSD, Yolo, etc. The data in the table is obtained from the official website of WiderFace.
And the precision-recall (PR) curves of our YOLO-FaceV2 face detector, along with the
competitors, are shown in Figure 6.
5 Conclusion
In this paper, aimed to address the problems of varying face scales, easy and hard sam-
ple imbalance and face occlusion, we propose a YOLOv5-based face detection method called
YOLO-FaceV2. For the problems of varying face scales, we fuse the P2 layer into the feature
pyramid to improve the resolution of small objects, design the RFE module to enhance the
receptive field and use NWD Loss to improve the robustness of our model to small target
detection. And we introduce Slide function to alleviate the easy and hard sample imbalance.
For the face occlusion, we use SEAM module and Repulsion Loss to solve it. Beside, We use
the information of the effective receptive field to design the anchor. Finally, we achieve close
to or exceeding SOTA performance on the WiderFace validation Easy and Medium subsets.
References
[1] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional
neural network cascade for face detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015.
[2] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using
multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–
1503, 2016.
[3] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen. Real-time rotation-invariant face detection
with progressive calibration networks. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018.
[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. Ssd:
Single shot multibox detector. 2015.
[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. IEEE Transactions on Pattern Analysis
& Machine Intelligence, 39(6):1137–1149, 2017.
[6] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P Dollár. Focal loss for dense object detection.
IEEE Transactions on Pattern Analysis & Machine Intelligence, PP(99):2999–3007, 2017.
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-
time object detection. IEEE, 2016.
[8] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In IEEE Conference on
Computer Vision & Pattern Recognition, pages 6517–6525, 2017.
[9] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv e-prints, 2018.
[10] A. Bochkovskiy, C. Y. Wang, and Hym Liao. Yolov4: Optimal speed and accuracy of
object detection. 2020.
15
[11] Glenn Jocher. Yolov5. https://github.com/ultralytics/yolov5.
[12] Y. Li, Y. Chen, N. Wang, and Z. Zhang. Scale-aware trident networks for object detection.
IEEE, 2019.
[13] J. Wang, Y. Yuan, and G. Yu. Face attention network: An effective face detector for the
occluded faces. 2017.
[14] Weijun Chen, Hongbo Huang, Shuai Peng, Changsheng Zhou, and Cuiping Zhang. Yolo-
face: a real-time face detector. The Visual Computer, 37(4):805–813, 2021.
[15] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. IEEE,
pages 5525–5533, 2016.
[16] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas. Imbalance problems in object detection:
A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1,
2020.
[17] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance seg-
mentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2018.
[18] M. Tan, R. Pang, and Q. V. Le. Efficientdet: Scalable and efficient object detection.
In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020.
[19] S. Qiao, L. C. Chen, and A. Yuille. Detectors: Detecting objects with recursive feature
pyramid and switchable atrous convolution. arXiv, 2020.
[20] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017.
[21] Min Chen, Xuemei Ren, and Zhanyi Yan. Real-time indoor object detection based on
deep learning and gradient harmonizing mechanism. In 2020 IEEE 9th Data Driven
Control and Learning Systems Conference (DDCLS), 2020.
[22] Y. Cao, K. Chen, C. C. Loy, and D. Lin. Prime sample attention in object detection.
2019.
[23] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in
deep convolutional neural networks. 2017.
[24] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3 fd: Single shot scale-invariant
face detector. In IEEE Computer Society, 2017.
[25] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Faceboxes: A cpu real-time face
detector with high accuracy. 2017.
[26] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection
network. ACM, 2016.
16
[28] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren. Distance-iou loss: Faster and better
learning for bounding box regression. arXiv, 2019.
[29] Y. F. Zhang, W. Ren, Z. Zhang, Z. Jia, L. Wang, and T. Tan. Focal and efficient iou loss
for accurate bounding box regression. 2021.
[30] J. Wang, C. Xu, W. Yang, and L. Yu. A normalized gaussian wasserstein distance for
tiny object detection. 2021.
[31] A. Trockman and J Zico Kolter. Patches are all you need? arXiv e-prints, 2022.
[32] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Repulsion loss: Detecting
pedestrians in a crowd. 2017.
[33] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li. Bridging the gap between anchor-based
and anchor-free detection via adaptive training sample selection. In 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[34] Chenchen Zhu, Yutong Zheng, Khoa Luu, and Marios Savvides. Cms-rcnn: Contextual
multi-scale region-based cnn for unconstrained face detection. 2017.
[35] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selective refinement network for
high performance face detection. 2018.
[36] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network
for object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2018.
[37] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. Ssh: Single stage headless face
detector. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
[38] S. Yang, Y. Xiong, C. L. Chen, and X. Tang. Face detection through scale-friendly deep
convolutional networks. 2017.
[39] Songtao Liu, Di Huang, et al. Receptive field block net for accurate and fast object
detection. In Proceedings of the European conference on computer vision (ECCV), pages
385–400, 2018.
[40] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and F. Huang. Dsfd:
Dual shot face detector. 2018.
[41] Z. Li, P. Chao, Y. Gang, X. Zhang, and S. Jian. Detnet: A backbone network for object
detection. 2018.
[42] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal
generation and joint object detection. IEEE, 2016.
[43] B. Singh and L. S. Davis. An analysis of scale invariance in object detection - snip. 2017.
[44] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. Pyramidbox: A context-assisted
single shot face detector. In Proceedings of the European conference on computer vision
(ECCV), pages 797–813, 2018.
[45] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single-stage
dense face localisation in the wild. 2019.
17
[46] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with
online hard example mining. In IEEE Conference on Computer Vision & Pattern Recog-
nition, pages 761–769, 2016.
[47] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 37(9):1904–16, 2014.
[48] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance seg-
mentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2018.
18