A Review of YOLO Object Detection Algorithms Based
A Review of YOLO Object Detection Algorithms Based
Abstract: Object detection is a research hotspot in the field of computer vision, and YOLO series shows good performance in
object detection, and has been widely used in robot vision, unmanned driving and other fields in recent years. This paper first
introduces the YOLO series algorithm, including the principle, innovation points, advantages and disadvantages of various
algorithms, then introduces the application field of YOLO series, and finally analyzes its future development trend to provide
reference for the topic research.
Keywords: Object Detection; YOLO; Deep Learning.
17
make local predictions, it can avoid falling between several 2.4. YOLOv4
cells with good confidence at the same time, but it is not a
Compared with YOLOv3 algorithm, YOLOv4 combines
matter of grabbing the things needed. [10] With the further
many previous research techniques, combines them with
development of neural networks, the YOLO series of object
appropriate innovative algorithms, and achieves a perfect
detection shows good performance, and many branches are
balance between speed and accuracy compared with the
derived, as shown in Figure 1 below.
previous YOLO series. However, YOLOv4 is not much
different from YOLOv3 in essence. YOLOv4 uses
CSPDarknet53 as the backbone network for feature extraction,
which reduces the consumption of computing resources. SPP
structure and residual connection network are introduced to
accelerate and optimize the convolution process, thus
improving the prediction performance and speed; post-
processing optimization measures such as dynamic
adjustment of IoU threshold and Logistic Activation instead
of SoftMax were added to optimize the accuracy and
robustness of detection results. The OpenCV DNN module is
used to optimize the forward calculation performance, and the
detection speed is greatly improved on the premise of
ensuring the detection accuracy. However, the computing
resource consumption is high, because the YOLOv4 model is
relatively large and requires a lot of calculation. Compared
Figure 1. Part of the YOLO series products with some other networks, YOLOv4 is more complex.
2.2. YOLOv2 Although YOLOV4 has great advantages in performance, it
is also more complex, requiring more training time and more
Compared with YOLOv1 algorithm, YOLOv2 adopts network parameters. Poor effect on s mall target detection:
Darknet19 network, which includes 19 convolution layers Because YOLOv4 uses a larger anchor box and higher
and 5 max pooling layers. 3x3 and 1x1 convolution are prediction resolution, its effect on small target detection is
mainly adopted, which are two kinds of convolution layers. worse than that of other algorithms.
The 1x1 convolution here can compress the channel number
of feature map. In this way, model computation and 2.5. YOLOv5
parameters can be reduced; The introduction of Anchor and Compared with YOLOv4 algorithm, YOLOv5 is smaller
K-means clustering improved the recall rate; In order to than YOLOv4 model, so it has higher computing efficiency
prevent overfitting, BN layer is used after each convolutional and lower video memory occupation. Some new evaluation
layer to speed up model convergence. Finally, global avg pool indexes are added to YOLOv5, such as mAP@.5, mAP@.75,
is used for prediction. The feature fusion module etc., which help to evaluate the model performance in a more
(passthrough) is introduced to fuse fine-grained features. By comprehensive way. YOLOv5 enhances the learning ability
using YOLOv2, the mAP value of the model is not of CNN, which makes it lightweight while maintaining its
significantly improved, but the calculation amount is reduced. accuracy in the detection process. YOLOv5 can deduce
However, YOLOv2 will lose small targets, because the effectively from single image, batch image, video and even
resolution of the feature map is subsampled during the design, webcam port input directly. YOLOv5s's object recognition
YOLOv2 cannot detect small targets well. [11] The CNN speed of up to 140FPS is impressive. From YOLOv5n to
infrastructure used in YOLOv2 is relatively simple and does YOLOv5x, the detection accuracy of these five YOLOV5S
not use RPN network, which leads to its accuracy gap models gradually increased, while the detection speed
compared with some advanced object detection algorithms. gradually decreased. [13] However, the biggest drawback of
2.3. YOLOv3 YOLOv5 is its weak target detection capability. In some
complex scenes, the effect will be weak. For non-standard
Compared with YOLOv2 algorithm, YOLOv3 adopts scenes, the performance is poor. If the scene is special or the
Darknet-53 as the backbone network. Compared with target distance is relatively far, the accuracy of YOLOv5 will
Darknet-19, YOLOv3 has deeper network, more parameters be decreased and the performance is poor. Shallow network
and more adequate training, so the detection performance is depth. Compared with some other target detection algorithms,
better. YOLOv3 adopts multi-scale prediction, which can YOLOv5 has a shallow network depth, which will affect its
improve the detection accuracy of small targets by detecting performance.
objects on different scales. Feature pyramids with different
convolution kernel sizes are used to detect objects of different 2.6. YOLOv7
sizes, which overcomes the shortcomings of YOLOv2 in at present, the model accuracy and inference performance
small target detection. YOLOv3 uses three anchor frames of are more balanced is YOLOv7 model (corresponding to the
different sizes, which can better adapt to objects of different open-source git version 0.1). YOLOV7 is the most advanced
sizes and aspect ratios; And the introduction of a residual algorithm of YOLO series at present, surpassing the previous
network structure, which is able to learn features better and YOLO series in accuracy and speed. YOLOv7 introduces the
speed up training. [12] However, YOLOv3's inference is ResNet50 deep residual network, replacing the
slower because the detection process needs to run multiple CSPDarkNet53 of YOLOv5; To improve the generation of
convolutional layers; YOLOv3 has problems with inaccurate Anchor frame, YOLOv7 introduces Anchor Free frame,
object positions. which does not require preset anchor frame, and can adapt to
various scales and aspect ratio targets; The multi-scale
18
training strategy can improve the detection ability of the and other aspects, target detection is a big topic, and real-time
model for s mall targets and better adapt to the change of online detection is the top priority, which is worthy of
image resolution. Mosaic data enhancement method is continued research and development. This paper puts forward
introduced, which can randomly combine images to increase several prospects for the future development direction of
the diversity and quantity of training samples. The CUDNN YOLO:
library is used for deep learning calculation, which improves 1. Data enhancement: Adding more data can improve the
the training and reasoning speed of the model; YOLOv7 accuracy of the model. Data enhancement can be achieved by
algorithm is built based on deep learning technology, which rotating, scaling, cutting, flipping, color transformation, etc.,
can improve its detection performance through continuous on the training image. Consider using more data enhancement
learning and support transfer learning of deep models. [14] techniques to augment the data set.
However, higher computing resources, more layers and 2. Better pre-trained model: Consider using stronger pre-
parameters are required, so higher computing resources are trained models to initialize the YOLOv7 network to improve
required, and more efficient Gpus are required for training the accuracy and robustness of the model. For example, larger
and reasoning. The detection effect of small size and crowded and deeper models such as ResNet, EfficientNet, etc., can be
scenes is not good. Although YOLOv7 has improved and used.
optimized it, the detection effect is still poor. 3 Adaptive learning rates: To better optimize the model,
you can use the adaptive learning rate to adjust the learning
3. YOLO Algorithm Application Field rate of each layer to better balance speed and precision during
based on Deep Learning training.
4. Improvement of activation function: It is possible to
YOLO series involves many application fields. For consider using other activation functions such as Swish, Mish,
YOLOv7, which has the best performance at present, the etc. to replace LeakyReLU used in YOLOv7 to improve the
following application examples are introduced: performance of the model.
1. Real-time target detection: YOLOv7 can be used for 5. Improvement of target detection loss function: The
real-time target detection, such as automatic driving, video cross-entropy loss function used by YOLOv7 is one of the
surveillance and security systems. It needs to process high- common target detection loss functions, but it may not be
resolution images and high-speed video, and can correctly optimal in some cases. You can try to use other target
detect targets, thus helping monitors and decision makers to detection Loss functions, such as Focal Loss, IoU Loss, etc.
react quickly. [15]- [16] 6. multi-task learning: In addition to target detection, other
2. Object recognition: YOLOv7 can also implement object tasks such as semantic segmentation and instance
recognition. For example, in a large warehouse, there are segmentation can be included in the training to improve the
many different types of products and equipment. Each object comprehensive performance of the model.
can be accurately identified and classified using YOLOv7, 7.Model compression: Some model compression
allowing for better management and control of inventory and techniques, such as pruning and quantization, can be
production processes. considered to reduce the model size and inference time of
3. Face recognition: YOLOv7's face detection function can YOLOv7, so as to adapt to the deployment on low-power
be used for applications such as security access control, devices. These are just some ideas, and how to combine and
intelligent gate, party check-in, etc. By using YOLOv7 for implement them still needs specific experiments and
face detection and recognition, it can ensure that only debugging. How to achieve these improvements while
authorized personnel can enter the area, and the activities of maintaining the performance of the model also needs to be
participants can be tracked. considered.
4. Natural Language Processing: YOLOv7 can also be used
in conjunction with natural language processing (NLP) 5. Conclusion
technology to enable automatic text extraction and
classification. For example, monitoring specific keywords on Based on the development history of YOLO, this paper
social media to see what people are saying about an event. introduces the principle, innovation points, advantages and
This can help businesses and institutions understand the views disadvantages, application fields and future development
of opinion leaders and consumers to make better marketing trend of YOLO series algorithms, and shows the effect that
strategies and decisions. YOLO series algorithms can achieve at present, which can
5. Robot control: YOLOv7 can be used as a vision engine greatly improve the efficiency of production and life in the
in robot control. Robots can use YOLOv7 to detect and track application field. It has great application prospects in the field
objects, such as humans, other robots or obstacles. This can of safety monitoring with high flexibility and timeliness. This
help the robot avoid collisions, locate the target correctly, and paper looks forward to YOLO series and provides ideas for
react quickly in case of an emergency. the future development of YOLO series.
19
[3] DALAL N and TRIGGS B. Histograms of oriented gradients [9] Zhou Xiaoyan, Wang Ke, Li Lingyan. An overview of object
for human detection[C]. 2005 IEEE Computer Society detection algorithms based on Deep Learning [J]. Electronic
Conference on Computer Vision and Pattern Recognition, San Measurement Technology, 2017, 40(11):5.
Diego, USA, 2005:886-893. doi: 10.1109/ CVPR.2005.177.
[10] Zhang Qi, ZHANG Rongmei, Chen Bin. A review of image
[4] Liu Pu, Zhang Xing-Hui, Zhang Zhi-Li et al. Overview of recognition technology based on deep learning [J]. 2019.
Object Detection from RCNN to YOLO [C]// China High-tech
Industrialization Research Association Intelligent Information [11] Zheng Wei-cheng, LI Xue-Wei, LIU Hong-zhe. Overview of
Processing Industrialization Branch. The sixteenth national object Detection Algorithms based on Deep Learning [C]// The
signal and intelligent information processing and application of 22nd Annual Conference on New Network Technologies and
academic conference proceedings. [publisher unknown], Applications, 2018, Network Application Branch of China
2022:8. DOI: 10.26914 / Arthur c. nkihy. 2022.053359. Computer Users Association. 0.
[5] Liu Yang. Masks worn under the dense crowd scene detection [12] Liu Yan-Qing. Improved Object Detection Algorithm Based on
algorithm research [D]. Hebei university of engineering, 2022. YOLO Series [D]. Jilin University.
The DOI: 10.27104 /, dc nki. Ghbjy. 2022.000365. [13] Zheng Weicheng, LI Xuewei, LIU Hongzhe. An overview of
[6] Shi Duan-Yang, Lin Qiang, Hu Bing, ZHANG Xin-Yu. Radar object detection algorithms based on Deep Learning [J]. China
Science and Technology,2022,20(06):589-605. Broadband, 2022(3):3.
[7] Zhou Shujuan, Zhang Yu, Zhang Heng, Wang Qi, Liu Guangjie, [14] Li Bingzhen, Jiang Wenzhi, Gu Guide, et al. Review of object
Zhu Jinlong. Object recognition of TEM nanoparticle detection algorithms based on Convolutional neural Networks
structures based on deep learning [J]. Journal of Changchun [J]. Computer and Digital Engineering, 2022(005):050.
Normal University,2022,41(12):35-40. [15] Guo Zefang. Overview of Deep Learning Algorithms for Image
[8] Zhang Shan, Lu Yujiao, Luo Dawei. Overview of object Object Detection [J]. Mechanical Engineering & Automation,
Detection Algorithms based on Deep Learning [C]// 2019 (1):4.
Proceedings of the 12th Annual Conference on New Network [16] Wang Yan. Research on visual detection and tracking
Technologies and Applications, Network Application Branch, technology of surgical instruments based on deep learning.
China Computer Users Association, 2018.
20