FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
https://doi.org/10.1007/s11554-024-01440-w
RESEARCH
Abstract
Flying-object detection has become an increasingly attractive avenue for research, particularly with the rising prevalence of
unmanned aerial vehicle (UAV). Utilizing deep learning methods offers an effective means of detection with high accuracy.
Meanwhile, the demand to implement deep learning models on embedded devices is growing, fueled by the requirement for
capabilities that are both real-time and power efficient. FPGA have emerged as the optimal choice for its parallelism, flex-
ibility and energy efficiency. In this paper, we propose an FPGA-based design for YOLOv4 network to address the problem
of flying-object detection. Our proposed design explores and provides a suitable solution for overcoming the challenge of
limited floating-point resources while maintaining the accuracy and obtain real-time performance and energy efficiency. We
have generated an appropriate dataset of flying objects for implementing, training and fine-tuning the network parameters
base on this dataset, and then changing some suitable components in the YOLO networks to fit for the deployment on FPGA.
Our experiments in Xilinx ZCU104 development kit show that with our implementation, the accuracy is competitive with
the original model running on CPU and GPU despite the process of format conversion and model quantization. In terms of
speed, the FPGA implementation with the ZCU104 kit is inferior to the ultra high-end GPU, the RTX 2080Ti, but outper-
forms the GTX 1650. In terms of power consumption, the FPGA implementation is significantly lower than the GPU GTX
1650 about 3 times and about 7 times lower than RTX 2080Ti. In terms of energy efficiency, FPGA is completely superior
to GPU with 2–3 times more efficient than the RTX 2080Ti and 3–4 times that of the GTX 1650.
Keywords FPGA · YOLO · Neural network · UAV · Object detection · Vitis HLS
1 Introduction UAVs or drones have been widely used for many differ-
ent purposes. With increasingly modern technology, it is
In recent years, image-based analysis and processing algo- equipped with many advanced and flexible functions and
rithms have been a hot topic and received a lot of attention. its price is not too expensive. UAVs are used in many fields
Image processing is not only applied in civilian fields but with many different purposes and some of them can threaten
also widely in the military and healthcare fields. In these security. Therefore, it is necessary to detect illegal UAVs in
fields, aerial and remote sensing images are used to detect many different conditions and with diverse sizes. The detec-
aircraft targets in military bases and airfields, which are of tion of flying-object can be divided into detection methods
great significance for intelligence deployment. Recently, based on Radar [1], radio frequency waves [2], or image and
video [3]. In [4, 5], authors indicate that the image-based
detection methods are less sensitive to environmental noise
* Minh‑Thuy Le compared with other methods. In [6–8], deep learning is
thuy.leminh@hust.edu.vn
used for fast flying-object detection.
Dai‑Duong Nguyen Developing a deep learning model involves two main
duong.nguyendai@hust.edu.vn
phases: training and inference. Most of the applications
Quoc‑Cuong Nguyen typically run on general-purpose processors such as CPUs
cuong.nguyenquoc@hust.edu.vn
and especially GPUs because their architecture consists of
1
School of Electrical and Electronic Engineering, Hanoi hundreds of cores. This design feature made GPUs the cur-
University of Science and Technology, Hanoi, Vietnam rent hardware platform of choice for several machine learn-
2
Control Automation in Production and Improvement ing and deep learning applications. For the training process,
of Technology Institute, Hanoi, Vietnam
Vol.:(0123456789)
speed is not a priority but achieving the highest possible platforms. Otherwise, Yap [15] proposed the fixed-point (16-
accuracy to ensure the correct implementation of the model. bit) implementation of a CNN-based object detection model.
Normally, the training of the model takes a lot of resources Using an OpenCL high level implementation tool, the author
and time so it is done on the GPU. While providing a flex- has generated Tiny YOLOv2 synthesis on a Cyclone V
ible and agile computing platform for development, GPUs FPGA development board. The experiments show that on a
or CPUs are not optimized for fast inference and poor power working frequency of 100 MHz the proposal achieves peak
efficiency, especially in embedded or IoT applications. A performance of 21 GOPs. In 2019, REQ-YOLO, a resource-
solution is using a high-end server for machine learning aware, systematic weight quantization framework for object
inference. However, deploying at the edge has many benefits recognition, was presented by Ding [16]. The author sug-
compared to deploying with a server or cloud. The reason is gested a complete hardware implementation of block cir-
that data is processed immediately without having to upload culant matrices on CONV layers and developed an efficient
to the server for saving bandwidth, faster response time and processing element (PE) structure supporting the heteroge-
higher reliability [9]. neous weight quantization to enable real-time and highly
The FPGA is an excellent candidate that can solve these efficient implementations on FPGA. As a result, REQ-YOLO
problems with its powerful processing power, low power can greatly compress the YOLO model while only slightly
consumption, and low latency. FPGAs with their inherently degrading accuracy. Recently, in 2022, Chen [17] introduced
parallel architecture are suitable for ML and DL applications an intelligent real-time object detection system for drones
[10, 11], moreover, the FPGA can be reconfigured anytime. using FPGA. The neural network engine is based on the
Therefore, developers optimize and accelerate their systems FPGA and designed to accelerate NN models. According
by designing specialized hardware accelerators (hardware to experimental findings, the suggested FPGA architecture
accelerators) using FPGAs. As in [12] the authors use a het- effectively leverages FPGA computing resources, with cor-
erogeneous platform acceleration method (FPGA + GPU) responding DSP and LUT utilization rates of 81.56% and
that outperforms GPU acceleration. They show that CNN’s 72.80%. The system can detect objects at a rate of 8 frames
direct hardware mapping (DHM) on embedded FPGAs per second and uses less power by employing the YOLOv3-
outperforms GPU implementations in terms of power effi- tiny model for quick object detection. Wenhao et al. [18]
ciency (approximately 25% reduction) and execution time improved the YOLOv2 model by using the high concur-
(21% reduction in latency). A direct comparison between rent capabilities of CPU + FPGA structure. The accelerator
FPGA and GPU using DNN was done by [13] and the results makes use of floating-point quantization, adder optimization,
showed that the FPGA used (Stratix 10) can achieve more and various HLS optimization techniques. The PYNQ-Z2
GOP/s/Watt than the GPU (Nvidia Titan X), about 2.3–4.3 platform of Xilinx is used to conduct the test. According to
times. the author, the overall computing power of the accelerator
In this work, we propose an application-based neural net- is 27.1 GOP/s, the average detection accuracy is 80.6%, and
work running on FPGA to detect flying-object. Our contri- the overall power consumption is 2.609 W. Only 2% of the
bution can be stated as follows: (1) Build our custom dataset attained precision is lost when compared to CPU and GPU,
with different enhanced techniques to train our own network however, there is a significant reduction in power usage. A
parameters for 5 custom flying-objects; (2) Evaluate different coal gangue identification method based on YOLOv4-tiny
version of YOLOv3 and YOLOv4 to get the best network; was proposed by Xu [19] and implemented on the low-power
(3) Propose some modifications and adaptations to YOLOv4 hardware platform FPGA. First, the computer platform is
network to be suitable for implementation on FPGA; (4) used to train the YOLOv4-tiny model. The combination
Deploy our custom network on FPGA ZCU104 using Vitis of a BN layer and a convolution layer, along with a 16-bit
AI tool and make a comparison with GPU platforms. fixed-point quantization further minimizes the model’s
computation. Second, IP kernels for convolution and pool-
ing have been developed on the FPGA platform to acceler-
2 Related works ate convolution and pooling calculation. The proposal was
tested on the self-made coal gangue dataset and the result
In the literature, several researchers implemented YOLO in indicated that the precision for coal gangue recognition on
FPGA for different applications. In 2018, Wei et al. [14] the FPGA platform is slightly lower than those of CPU and
implemented the YOLO platform based on the Xillinx GPU. Hardware power consumption of the FPGA platform
Zynq board and optimized its architecture combined with is only 2.86 W and the energy efficiency ratio is 10.42 and
the FPGA feature. Real-time use cases of pedestrian and 3.47 times better than that of CPU and GPU, respectively.
vehicle recognition have been demonstrated. The results of On the other hand, Zhang [20] presented a resource-con-
testing show that the average time to recognize a picture on strained FPGA implementation of YOLOv2 with optimized
this platform is 51 ms, which is 46.9 times faster than for PC data transfer and computing efficiency. Firstly, on-chip data
transfer between different types of layers is allowed based on on detecting a single target class such as aircraft or UAVs
a scalable cross-layer dataflow strategy. The flexible off-chip or two classes such as UAVs and birds, which fail to exploit
data transfer is also offered when the intermediate results the detection and recognition capabilities of many targets.
are unaffordable on-chip. A filter-level data-reuse dataflow Therefore, in this paper, we built a data set of many different
strategy together with a filter-level parallel multiply-accumu- flying targets to demonstrate the model’s diverse and flex-
late operation computing processing elements array is then ible detection capabilities. For aerial targets, there are rela-
developed. Finally, multi-level sliding buffers are developed tively few available datasets that are published for research
to optimize the convolutional computing loop and reuse the purposes. Most of them are about detecting objects on the
input feature maps and weights. The author argued that this ground in aerial photographs from UAVs or satellite images.
implementation has achieved 4.8 W of low-power consump- Hence, it is necessary to build a dataset of aerial targets,
tion for executing YOLOv2, using low-resource of 8.3 Mbits adding more images to the dataset to diversify. We defined
on-chip memory, high throughput with 100.33 GOP/s and five popular target classes in our model as following: rotated
power efficiency is 20.90 GOP/s/W. In 2022, Zhang et al. wing UAV, fixed wing UAV, military helicopter, fighter air-
[21] introduced a design of FPGA-Based acceleration for craft and birds.
YOLOv4-Tiny object detection model by combining soft- We made use of four public datasets: the DUT Anti-UAV
ware and hardware. The author realized the accelerated Detection and Tracking dataset [23], the Military Aircraft
detection inference process from the original 6–7 mins to Detection dataset [24], the Military Aircraft Detection in
383 ms. First, the static quantization method of fixed-point Aerial Images dataset [25], the Flying Object dataset [26].
numbers is chosen, the position of the decimal point is However, we collected more images from public photos and
fixed and then the batch norm is added between the con- videos on the Internet to increase the diversity. For efficient
volutional layer and the activation function to form a con- data labeling for additional images, our method is that we
nection structure. Second, the inference speed on an FPGA train a pre-model based on the four public datasets. Here, we
with a version of ZYNQ-7020 is improved by increasing use the YOLOv4 network and then use this trained model
the bandwidth cap, reducing bandwidth requirements and to identify targets in the image that we collected to get the
employing a massive pipeline design. In the test of the Coco assigned labels of new images. We then check, correct and
dataset, the average inference speed of the YOLOv4-Tiny add missing or wrong labels with the labeling tool named
object detection model is reduced from 7.13 mins per frame LabelImg [27]. The description of four public datasets are
to 498.89 ms per frame while keeping the average accu- as follows:
racy value at around 95%. In the latest work in 2023, Zheng
[22] presented a class YOLO target detection algorithm and 1. Dataset DUT anti-UAV detection and tracking includes
deployed it to an FPGA platform. To enable the algorithm 10,000 images of 35 different types of quadrotors. Pho-
to run efficiently on FPGAs, the author quantized the model tos include many resolutions from the smallest 160 ×
and wrote the corresponding hardware operators based on 240 to 3744 × 5616. In addition, a variety of lighting
the model units. The proposed object detection accelerator conditions (day, night, sunrise,...) and different weather
has been implemented and verified on the Xilinx ZYNQ types (sunny, cloudy, snowy,...) are also considered.
UltraScale+ MPSoCs XCZU7EV platform using Vitis HLS. 2. Military aircraft detection dataset includes 10,300
Experimental results show that the detection accuracy of images with 41 different types of military aircraft. Most
the algorithm model after quantizing the network to 4 bits of the images are fighter aircraft and unmanned combat
is 87% which is comparable to that of common algorithms. aircraft. The number of images used in this dataset is
The power consumption is 5.508 W which is lower than that 4695.
of the CPU and GPU. 3. Military aircraft detection in aerial images (MADAI)
dataset includes 2558 images divided into 5 target
classes: bombers, civil aircraft, reconnaissance aircraft,
3 Neural network for flying‑object detection fighters and military helicopters. In this dataset, we used
using YOLO pictures of fighter aircraft and military helicopters. The
number of images used is 1123.
3.1 Datasets preparation 4. Flying object dataset contains 15,064 images with vari-
ous aircrafts, helicopters and birds flying in the air. The
Deep learning-based target detection has the challenge of number of images used in this dataset is 7087.
building enough data with annotated labels for training and
testing. Training with large datasets and diverse images In total, after aggregating the 4 datasets, there are 22,905
will give good results and prevent overfitting or underfit- images. We found that the birds and rotated wing UAV
ting. Researches applying deep learning models often focus target classes have a superior number compared to other
target classes. The image quality of these two classes is Table 1 Feature classes distribution
also very high, the size of objects varies from very small Class Training Validation Total
to large compared to the image size, with a large number
of images having very small objects. For the goal of being Bird 9225 2165 11,390
able to detect and identify small targets, it is necessary to RW UAV 7143 1882 9025
add images for other objects classes, especially images FW UAV 7234 1748 8982
taken at long distances so that the target is small com- Fighter 6378 1505 7883
pared to image size. The classes of drones, fighters and Helicopter 4014 922 4936
helicopters can be named as examples. We proceeded to
collect more photos and cut image frames from videos in
the internet. After adding, the total number of images in and effectiveness in creating a representative split without
our dataset is 34,000 images where image size varies from introducing biases. From the chart in Fig. 2, we can see
640 × 512 to 8508 × 5939, Fig. 1 is some examples of our that the classes distribution in training and validation set
custom labeled dataset. The number of additional images is roughly equal, ensuring the split is unbiased.
is 11,095 images, of which 2000 images taken from sepa-
rate video sources are taken as the test set to ensure fair- We train two models YOLOv3 and YOLOv4 on the
ness when evaluating the best model in practice. The data training set, the quality scale of the model used is mAP0.5.
set will be divided into the training set, the validation set, The test set here will be used to evaluate each model after
and the test set. From 34,000 images, 2000 images men- having trained a certain number of iterations. We did the
tioned above will be taken as a test set to evaluate the final training until the maximum mAP0.5 is reached on the vali-
results of the model on real data. The remaining 32,000 dation set and the average loss does not decrease any more.
images are randomly divided into training and validation Since then, we have 2 sets of weights of the YOLOv3 and
sets in the commonly used ratio 8:2 as shown in Table 1, YOLOv4 models with the best mAP0.5 on the validation
this widely accepted practice is chosen for its simplicity set.
3.2.3 Classes and filters in the convolutional layer number of YOLO detection classes will be 3 which is equal
before the YOLO layer to YOLOv3 and YOLOv4 full model. It will help the detec-
tion model to be more accurate because each The YOLO
The number of filters in the convolutional layers before the layer in the network performs feature detection at a different
YOLO layer determines the number of output channels for spatial scale. Therefore, the additional layer will increase the
each predicted bounding box. The default number of filters spatial resolution of the feature map, allowing extraction of
in the convolutional layers is 255, which is consistent with more detailed characteristics leading to better capture of the
the original YOLO model trained on the COCO dataset with details of small targets. Of course, adding a YOLO layer will
80 feature classes. Our model has 5 feature classes so these make the model heavier and decrease the processing speed.
parameters need to be changed following the Eq. 1. We named these 2 models YOLOv3-tiny and YOLOv4-tiny
with one more layer as YOLOv3-tiny 3L and YOLOv4-tiny
filters = (classes + width + height + x + y + confidence) × num
3L.
(1)
where the 5 feature classes correspond to classes = 5. 3.3 Model hyperparameters selection
Since the YOLO layer predicts 3 bounding boxes for each
grid cell in the feature map so that num = 3, other oper- The learning rate is selected as the default value suggested
ands take the value of 1. Finally, filters take the value of by the YOLO network author. During training with the
(5 + 1 + 1 + 1 + 1 + 1) × 3 = 30 . This change ensures that default learning rate value does not achieve good results,
each YOLO class outputs the right bounding boxes to detect the value will be changed. Besides, training will be done
custom objects. for a number of iterations (max_batches) so choosing an
appropriate value for max_batches is very important to avoid
3.2.4 Adding a YOLO detection layer to the YOLO‑tiny overfitting or underfitting. The author of the YOLO network
version recommends setting max_batches=classes × 2000, where
classes are the number of feature classes, and max_batches
Due to the fact that the simplifying versions YOLOv3-tiny is not less than the number of images in the training set and
and YOLOv4-tiny have poor accuracy compared to the full not less than 6000. The purpose is to ensure that the training
versions YOLOv3 and YOLOv4, especially the ability to model has enough repetitions to learn the characteristics of
detect small targets. For the goal of detecting flying objects the targets and the objects. The number of iterations required
in our research, these two simplifying versions may not be for training depends on the complexity of the object, the
suitable even though they can achieve high FPS. Hence, it size of the dataset, and the optimizer’s learning rate. The
is necessary to improve the architecture of these two models steps value will be set to 90% of max_batches. After training
to increase the detection ability. For more accurate detection 90% max_batches, the learning rate will be multiplied by a
results with small targets, the simplest change is to add 1 default rate of 0.1, the purpose of this is to help the model
YOLO detection layer to these 2 models. converge faster.
The addition of 1 YOLO layer to the YOLOv4-tiny The choice of network size is also important because it
model is shown in Fig. 4, the original YOLO-tiny version directly affects the quality of the model. The size of the net-
only has 2 YOLO layers. The YOLO layer is responsible work is closely related to the ability to detect small targets,
for predicting bounding box and feature class probability. it is obvious that the larger the image size, the easier it is to
After having 1 additional YOLO detection layer, the total detect because small objects are represented in more pixels.
Usually, the default network size is selected as a 1:1 square The GPU used for training is a high-end GPU, so using an
ratio such as 416 × 416 or 640 × 640. However, few cameras additional low-end GPU will make the evaluation more
have such resolution but usually 4:3 or 16:9. We chose hence detailed and richer. Therefore, we use personal computer
the size 640 × 480 (VGA standard, 4:3 ratio) because this hardware with the configuration as follows:
is a very popular and common resolution. In practice, if we
choose a square ratio resolution, the image frame will be • GPU: NVIDIA GeForce GTX 1650 4GB GDDR6, 3.0
distorted when deployed in a real camera. The target will no TFLOPS computation capacity. TDP 50W.
longer keep the original scale due to the image resize, which • CPU: AMD Ryzen 5 5600H (6C/12T, 3.3/4.2GHz, 3MB
leads to incorrect commands and reduces the accuracy of L2/16MB L3). RAM: 16 GB DDR4-3200 MHz.
the model.
NMS threshold and object class probability threshold are The graph of the loss function and mAP of the models
two parameters that directly affect the prediction results of during training is shown in Fig. 5. During training, the aver-
the model. Therefore, choosing the optimal pair of these age loss value and mAPval 0.5
are calculated after each iteration.
two parameters is very important. The object class prob- All models will be evaluated with the best weights mAPtest 0.5
ability threshold (conf_threshold) determines the minimum on the test set. Speed is evaluated by performing inference 1
confidence score for a detection result to be considered video with HD resolution of 1280 × 720, 1 m:31 s duration,
as true positive. Detections with lower confidence scores 30fps, a total of 2747 frames. Testing configuration is batch
will be rejected because they are more likely to be false = 1 and subdivisions = 1. We get the final average FPS result
positives. The non-maximum suppression (NMS) thresh- after completing the video.
old associated with the NMS post-processing step removes Table 2 shows the results of evaluating the accuracy of
duplicate bounding boxes of the same object. This threshold the models on the validation set. It can be seen that the mod-
determines the overlap between two bounding boxes, which els have quite high results on the validation set, which is
bounding box will be removed. A lower threshold will result understandable because the weights of the selected mod-
in more bounding boxes being generated and a higher thresh- els are the weights with the highest results on the valida-
old will result in a lower number of bounding boxes with tion set. YOLOv4 and YOLOv3 models have quite similar
higher confidence. These 2 parameters need to be adjusted results, not significantly different. These 2 versions are sig-
to create a balance between precision and recall. For the nificantly superior to the shortened versions. The reduced
detection of small aerial targets, a low conf_threshold value models YOLOv3-tiny, YOLOv3-tiny 3L, YOLOv4-tiny, and
should be used to ensure as many detections as possible, YOLOv4-tiny 3L also has results that are not significantly
even if they have a low confidence score. The YOLO default different. Relying on the validation set alone is not enough to
value for conf_threshold is 0.25.Nms_threshold must also be draw conclusions about which model is the best, so it is nec-
a small value because small targets are more likely to overlap essary to evaluate the models on the test set. The test set is a
in the image. The YOLO default value for this threshold is set of unseen model images and is selected separately from
0.5. In order to choose the best value pair, we tried differ- other sources than the training/validation sets. The results
ent value pairs and evaluate the mAP, precision, recall, and of the accuracy evaluation of the models on the test set are
F1-score of the model with each pair of values to choose the shown in Table 3. The YOLOv4 model achieved outstanding
most optimal value pair. We only conducted this selection results on the accuracy scales compared to the remaining
for the best model. Before that, to choose the best model, models, showing an improvement of this version compared
the default values given by the YOLO author are considered to earlier versions of YOLOv3. From this table, we can also
where conf_threshold=0.25 and nms_threshold=0.5 (rated see that the effect of adding an extra layer of YOLO is evi-
mAP@0.5). dent with 2 models YOLOv3-tiny 3L and YOLOv4-tiny 3L.
The accuracy is improved compared to the version with only
3.4 Model training and evaluation 2 YOLO layers. The detailed result of AP for each target
class on the test set is shown in Table 4.
We decided to use the cloud platform VastAI (https://vast. In Table 5, it can be seen that YOLOv3-tiny is the model
ai/) to rent the available hardware to train the models. The that achieves the highest FPS on both high-end and low-
hardware configuration used for training is: end GPUs. The reason is that YOLOv3-tiny is the simplest
model and the trade-off is poor accuracy. All the rest of
• GPU: NVIDIA GeForce RTX 2080 Ti 11GB VRAM the minified models (YOLOv3-tiny 3L, YOLOv4-tiny and
(GPU RAM), 19 TFLOPS (TeraFLOPS) computation YOLOv4-tiny 3L) also have high FPS and small model sizes
capacity. TDP (Thermal Design Power) 250 W. and can run in real-time on both 2 GPU platforms. The two
• CPU: AMD Ryzen Threadripper PRO 3975WX full models YOLOv3 and YOLOv4 can meet real-time
32-Cores. RAM: 37 GB. requirements on high-end GPUs. However, with low-end
Fig. 5 Graphs of the loss function and mAP of the different models during training
GPUs like GTX1650, these two models have quite low FPS. the efficiency of the model without too much decrease in
In short, YOLOv4-tiny 3L is a model with pretty good accu- accuracy. The implementation method in our work is refer-
racy compared to the rest of the reduced models. It is of enced to ideas from the authors in [33] with the input of the
course not comparable to the full YOLOv4 model but can weights of the trained model. Illustrate the filter parameter
be run in real time with even low-end GPUs is GTX 1650. pruning step in Fig. 6, the significance of filters, represented
Therefore, we choose 2 models YOLOv4 and YOLOv4-tiny by blue and green rectangles, is assessed through the compu-
3L to continue deploying on FPGA. YOLOv4 is a large, tation of their L1-norm. Filters with lower values are chosen
complex and very heavy network model so it is necessary for pruning, as indicated by the green circle. The pruned
to reduce the computational complexity in order to meet model experiences a reduction in accuracy, prompting the
real-time performance and energy-efficient requirements on subsequent fine-tuning process. On the Darknet, the pruning
FPGA. process includes the following steps:
3.5 Model compression with the pruning method • Iterating through each convolution layer of the trained
model and calculate the L1 normalization of the weights
To reduce the size and computational requirements of the for each filter in the class.
model, many methods can be considered, such as reducing • Sorting the filters in each layer in descending order
the size (resolution) of the output and input of the network or according to their L1 norm.
reducing the number of layers of the model. However, with • Removing the filters with the lowest L1 norm at a preset
these two methods, the accuracy and ability to detect small ratio from each layer (set the weights of these filters to
targets of the model will be seriously affected. In our work, 0).
we chose an alternative called pruning method. Pruning is • Adjusting the input and output channels of subsequent
known as removing network nodes that do not contribute to layers to maintain the structure of the network.
detection. Pruning with an appropriate technique can slightly • Saving the trimmed model
reduce the accuracy of the model (or even improve accu-
racy in some cases) and make the model lighter and faster. We experimented with different pruning ratio values to
A typical pruning process consists of three stages: Firstly, find the most optimal value (pruning ratio 0.5 means that
training a large, overparameterized model (with too many 50% of filters with the lowest standard L1 value are removed
unnecessary parameters); Secondly, pruning the trained from the network). The model after pruning will be fine-
large model according to a certain criterion; Thirdly, fine- tuned for an additional 30,000 iterations with a learning rate
tuning the trimmed model to restore accuracy. The pruning of 0.05 to recover the accuracy. The pruning results with
techniques can be classified into three main groups: different scale values are shown in Table 6. We note that
the pruning ratio of 0.6 achieved the best results. The total
• Criteria-based: L0-Norm, L1- Norm, L2-Norm, L-Inf parameter of the model and the BFLOPS calculation cost
Norm and Random. decreased by 51%, mAPval 0.5
and mAPtest
0.5
were restored close
• Projection-based: PLS (Single) + VIP, PLS (Multi) + to the original model. With a higher pruning ratio of 0.7,
VIP, CCA (Multi) + CV and PLS (Multi) + LC. the model suffered a significant reduction in accuracy while
• Cluster-based: HAC+PCC. with a lower scale, the accuracy did not increase much. The
YOLOv4 model with a scale of 0.6 will be called YOLOv4-
For the YOLOv4 model, we selected the pruning technique pruned to distinguish it from the original YOLOv4 model
based on the standard with the L1 standard. The L1-based is and is selected for further deployment on the FPGA.
widely used and has been successfully applied to many deep Another problem that has been mentioned in Sect. 3.3
learning models, showing efficiency and flexibility. This is the selection of the conf_threshold and nms_threshold
method is also simple and easy to implement and does not parameter pairs. We have chosen YOLOv4, YOLOv4-
require adjustment of the model’s hyperparameters. Some pruned and YOLOv4-tiny 3L models to deploy FPGA. We
studies in state-of-the-art made comparison of different chose YOLOv4 model to find the best value pair, then evalu-
pruning techniques for the YOLOv4 model [4, 31–33] and ate the remaining 2 models on the found pair. Through the
achieved the best results with the pruning method based on survey from keeping one value and changing the remaining
the L1/L2 standard. In fact, L1 norm is also a type of regu- value (with a 0.05 step), in Table 7, we found that conf_
larization technique commonly used in machine learning and threshold = 0.15 is the most optimal value. If this value is
deep learning models to prevent overfitting. In the pruning set smaller, precision will decrease significantly with a lot of
technique, the L1 norm is used to determine the least sig- FP detection. Otherwise, in Table 8, the value nms_thresh-
nificant weights in the neural network and set them to 0, old = 0.25 is the most optimal value. When nms_threshold
which reduces the total number of parameters and improves is smaller than this value, with the same target, the model
Table 2 Models evaluation on the validation set will produce many overlapping bounding boxes with the
Model Precision Recall F1-score
same target. In short, with the pair of values conf_threshold
mAPval (%)
0.5
= 0.15 and nms_threshold = 0.25, the model achieved the
YOLOv3 92.07 0.91 0.89 0.90 best tradeoff between precision and recall. The mAP of all
YOLOv3-tiny 80.01 0.83 0.70 0.76 models are improved (increased by about 2% compared to
YOLOv3-tiny 3L 83.70 0.85 0.72 0.78 baseline). The precision is decreased slightly or remained
YOLOv4 93.0 0.89 0.91 0.90 constant and recall was significantly improved as shown in
YOLOv4-tiny 81.76 0.83 0.70 0.76 Table 9. The experiments also show that this pair of values is
YOLOv4-tiny 3L 83.8 0.86 0.71 0.78 suitable for detecting smaller targets. With the reduction of
nms_threshold from 0.5 to 0.25, the evaluated mAP results
The bold value of each column is the optimal value for the criteria
when comparing among the different yolo network versions will be mAP0.25 instead of mAP0.5.
Figure 7 shows some samples of the detection using
GPU. From left to right are the models: YOLOv4, YOLOv4-
Table 3 Models evaluation on the test set pruned and YOLOv4-tiny 3L. It can be noted that the
YOLOv4 model detects small targets relatively well with
Model mAPtest (%) Precision Recall F1-score
0.5 the definition of small targets used (16 × 16 to 42 × 42 pix-
YOLOv3 80.20 0.83 0.71 0.77 els in the 640 × 480 image, and 27 × 27 to 73 × 73 pixels
YOLOv3-tiny 70.43 0.74 0.65 0.69 in 1280 × 720 images), having high confidence scores and
YOLOv3-tiny 3L 75.06 0.78 0.67 0.72 less confusion among target classes. YOLOv4-pruned gives
YOLOv4 88.53 0.92 0.80 0.86 results comparable to YOLOv4. YOLOv4-tiny 3L can detect
YOLOv4-tiny 74.11 0.78 0.70 0.74 close and medium-range targets well but still has difficulty
YOLOv4-tiny 3L 77.90 0.80 0.73 0.76 for small targets, which are more easily confused and missed
than the 2 remaining models. For extremely small targets in
The bold value of each column is the optimal value for the criteria
the image such as a few pixels, all the 3 models are either
when comparing among the different yolo network versions
undetectable or detectable but with mistaken for another
target class.
Table 4 AP of each target class on test set
Model AP for each class (%) 4 FPGA‑Soc implementation of YOLOv4
Bird FW UAV RW UAV Fighter Helicopter
After selecting the 3 best models (YOLOv4, YOLOv4-
YOLOv3 73.90 82.97 81.46 80.98 81.67 pruned and YOLOv4-tiny 3L), the next stage is to deploy
YOLOv3 64.85 73.25 76.71 68.15 69.18 these models on FPGAs with the ZCU104 kit and the Vitis
YOLOv3 75.26 75.80 77.57 74.94 71.74 AI platform.
YOLOv4 85.50 92.60 92.48 84.22 87.86
YOLOv4 73.59 67.75 81.83 72.22 75.17 4.1 Converting model formats from Darknet
YOLOv4 76.04 76.84 82.08 77.13 77.43 to Tensorflow
Table 6 Pruning model Pruning ratio Total BFLOPS Weights mAPtest (%) FPS
mAPval (%)
evaluation with different scale parameters size (MB)
0.5 0.5
values (M)
RTX2080 GTX1650
Table 8 Nms_threshold analysis NMS thres. Precision Recall F1 TP FP FN Avg IoU (%)
with conf_threshold = 0.25
0.5 0.92 0.80 0.86 4714 412 1135 88.44
0.45 0.92 0.81 0.86 4723 403 1126 88.49
0.4 0.92 0.81 0.86 4732 394 1117 88.56
0.35 0.92 0.81 0.86 4740 386 1109 88.57
0.3 0.93 0.81 0.86 4743 383 1106 88.56
0.25 0.93 0.81 0.87 4747 379 1102 88.58
0.2 0.93 0.81 0.87 4746 380 1103 88.56
0.15 0.93 0.81 0.87 4747 379 1102 88.57
0.1 0.93 0.81 0.87 4747 379 1102 88.57
0.05 0.93 0.81 0.87 4745 381 1104 88.53
frozen graph is saved to the pb file. The pb file contains network. The weights are stored as constants in the graph,
the TensorFlow graph definition, including the model which means they cannot be modified during process-
architecture, weights, and any other variables needed for ing. This format conversion step did not affect too much
Table 9 Models evaluation with conf_threshold = 0.15 and nms the accuracy of the original model, almost negligible as
threshold = 0.25 shown in Table 10.
Model Precision Recall F1 mAPtest
0.25
(%)
4.2 Model quantization
YOLOv4 0.88 0.84 0.86 90.37
YOLOv4-pruned 0.85 0.86 0.85 89.02
There are two main quantization methods, PTQ and QAT.
YOLOv4-tiny 3L 0.75 0.77 0.76 78.95
Vitis AI Quantizer supports both of these methods with Ten-
sorFlow 1.15. PTQ does not require retraining or labeled
data. In most cases, the PTQ method is sufficient to achieve
8-bit quantization with accuracy similar to that of 32-bit
float. Otherwise, QTA requires fine-tuning and labeled
training data but allows lower bit quantization with possibly
Table 10 mAP evaluation of the transformation model and the origi- which provides APIs for communication between the DPU
nal model and the microprocessor, supported languages are C++ and
Model mAPval (%) mAPval (%) Python. In this project, the executable program is written in
0.25 0.25
Darknet TF frozen graph C++ language.
For a program that reads image frames from a camera or
YOLOv4 95.54 95.50 video for processing and display, the purpose of this pro-
YOLOv4-pruned 95.39 95.36 gram is to test experimentally whether the implementation
YOLOv4-tiny 3L 89.06 89.00 with the camera can meet the real-time requirements and
measure the FPS rate. We found that the time to perform the
pre- or post-processing is very small compared to the DPU
better model accuracy. In this project, we will use PTQ computation. Therefore, instead of using only 1 DPU core,
method. This step is performed using the Vitis AI quan- the frame processing program from the camera is designed
tizer which takes as input a 32-bit float model, performs to take advantage of 2 DPU cores to improve processing
preprocessing (combining batchnorm functions and removes speed. In fact, one DPU IP can support up to 4 DPU cores
unnecessary nodes for inference), then converts the weights/ but with the resources of the ZCU104 kit, only 2 DPU cores
biases and trigger values to the specified bit width (here 8 can be met. With 2 DPU cores running in parallel, we can
integer bits). make inferences with 2 images at the same time. The pro-
To collect activation statistics and improve the accuracy gram on CPU is hence designed in a multi-thread method
of quantized models, the Vitis AI quantizer must run sev- to be able to process many frames at the same time. Each
eral inference iterations to calibrate the activation values. thread is responsible for processing a separate input frame
Therefore, a calibration image dataset is required as input. sequence, performing preprocessing and inference using its
As recommended by Xilinx, quantization works well with own DPU core. In this way, the system can improve overall
100–1000 images. Back propagation is not required so that throughput and reduces latency. An illustration of the pro-
an unlabeled data set is sufficient. Here, we use 640 images gram workflow with 2 threads is shown in Fig. 8.
randomly taken from the training/testing set as the calibra- To achieve the highest performance (the highest FPS), it
tion set. After calibration, the quantized model is trans- is necessary to test the program with different thread num-
formed into a DPU deployable model that conforms to the bers to find the optimal number of threads. Multiple threads
DPU’s data format. This model can then be compiled by the is not necessarily good as it will increase the power con-
Vitis AI compiler and deployed to the DPU. sumption of the system and may not be able to effectively
use 2 DPU cores. The experiment on different number of
4.3 Program flow threads to find the optimal value will be covered in Sect. 5.2.
5.2 Processing time
Fig. 10 Detection result on FPGA (from left to right: YOLOv4, YOLOv4-pruned and YOLOv4-tiny 3L)
Table 12 Average processing time of each algorithm parts method cannot be as accurate as the hardware method but
Model E2E_MEAN (ms) DPU_ CPU_
gives us an approximation. The average power consumption
MEAN (ms) MEAN of the RTX 2080Ti GPU and GTX 1650 GPU and ZCU104
(ms) kit without our application processing are 22 W, 28 W and
16.1 W respectively.
YOLOv4 107.94 104.0 3.94
Table 14 presents the power consumption of the 2 GPU
YOLOv4-pruned 66.53 62.60 3.93
and Table 15 is the consumption of ZCU104 by differ-
YOLOv4-tiny 3L 19.18 15.27 3.91
ent number of threads. It is obviously that the executable
program runs with more threads leading to the higher the
power consumption. With only 1 thread, only 1 DPU core
is used, the power consumption is hence the lowest. Power
consumption increases significantly starting from running
Table 14 Average power consumption of GPU with 2 threads because 2 DPU cores are used. Combining
Model Power consumption (W)
with the results of the FPS evaluation in previous section, it
can be noted that when the executable program runs with the
RTX 2080 Ti GTX 1650 number of threads greater than 4, the FPS is not improved
YOLOv4 255.4 93.7 but system consumes more power. Therefore, we use FPGA
YOLOv4-pruned 220.8 84.1 implementation with 4 threads to compare with the imple-
YOLOv4-tiny 3L 180.5 76.3 mentation on the GPU.
Figure 11 summarizes the results of models deployed consuming significantly less power than GPU implementa-
on GPU and FPGA hardware platforms. Figure 12 com- tion. This makes FPGAs well suited for embedded vision
pares energy efficiency (FPS/W). In terms of speed, the applications requiring high power efficiency. With the appli-
FPGA implementation with the ZCU104 kit is inferior to cation of appropriate model compression methods, complex
the high-end RTX 2080Ti GPU, but outperforms the low- and heavy deep learning models can also achieve real-time
end GTX 1650 GPU. In terms of power consumption, the performance on FPGA.
FPGA implementation is significantly outperforms the GPU,
which is lower than GTX 1650 about 3 times and lower than
Funding This work is funded under project number B2024-BKA-08.
RTX 2080Ti about 7–8 times. In terms of FPS/W energy
efficiency, FPGA is also superior to GPU by 2–3 times more Data availability The data is included in the manuscript.
efficient than the RTX 2080Ti and by 3–4 times than that
of the GTX 1650. The efficiency of the pruning method is
clearly shown with the YOLOv4 model implemented on References
ZCU104 where the pruned model has a negligible decrease
in accuracy (mAPval0.25
is 88.13% compared to 88.21% before 1. Coluccia, A., Parisi, G., Fascista, A.: Detection and classifica-
pruning and mAPtest 82.41% compared to 83.70% before tion of multirotor drones in radar sensor networks: a review.
0.25
Sensors 20(15), 4172 (2020)
pruning) but the FPS is greatly improved (from 17.9 to 29). 2. Martian, A., Chiper, F.-L., Craciunescu, R., Vladeanu, C., Fratu,
The FPS/W power efficiency is also improved compared to O., Marghescu, I.: Rf based uav detection and defense systems:
original model (1.04 vs 0.6, about 73% more efficient). Survey and a novel solution. In: 2021 IEEE International Black
Sea Conference on Communications and Networking (Black-
SeaCom), pp. 1–4. IEEE (2021)
3. Dewangan, V., Saxena, A., Thakur, R., Tripathi, S.: Application
of image processing techniques for uav detection using deep
6 Conclusion learning and distance-wise analysis. Drones 7(3), 174 (2023)
4. Liu, H., Fan, K., Ouyang, Q., Li, N.: Real-time small drones
detection based on pruned yolov4. Sensors 21(10), 3374 (2021)
In this research, we presented an application based neural 5. Liu, B., Luo, H.: An improved yolov5 for multi-rotor uav detec-
network and its implementation on FPGA. We enhanced our tion. Electronics 11(15), 2330 (2022)
training dataset from various sources with different tech- 6. Mamdouh, N., Khattab, A.: Yolo-based deep learning frame-
niques. Our system can detect 5 well-known flying-object work for olive fruit fly detection and counting. IEEE Access 9,
84252–84262 (2021)
at a high accuracy. With our FPGA proposal, the accuracy 7. Jiang, C., Ren, H., Ye, X., Zhu, J., Zeng, H., Nan, Y., Sun, M.,
of the YOLOv4 network is reduced at an acceptable level Ren, X., Huo, H.: Object detection from uav thermal infrared
due to the process of format conversion and model quantiza- images and videos using yolo models. Int. J. Appl. Earth Obs.
tion. In terms of speed, the FPGA implementation with the Geoinf. 112, 102912 (2022)
8. Diwan, T., Anirudh, G., Tembhurne, J.V.: Object detection using
ZCU104 kit is inferior to the high-end RTX 2080Ti GPU, yolo: challenges, architectural successors, datasets and applica-
but outperforms the low-end GTX 1650 GPU. In terms of tions. Multimed. Tools Appl. 82(6), 9243–9275 (2023)
power consumption, the FPGA implementation is signifi- 9. Crockett, L., Northcote, D., Ramsay, C., Robinson, F., Stewart,
cantly lower than the GPU. Regarding the FPS/W energy R.: Exploring Zynq MPSoC: With PYNQ and machine learning
applications (2019)
efficiency, FPGA is completely superior to GPU by 2–3 10. Chen, R., Tianyu, W., Zheng, Y., Ling, M.: Mlof: machine
times more efficient than the RTX 2080Ti and by 3-4 times learning accelerators for the low-cost fpga platforms. Appl. Sci.
than that of the GTX 1650. In conclusion, our applica- 12(1), 89 (2022)
tion developed on FPGA can handle real-time speed while
11. DiCecco, R., Lacey, G., Vasiljevic, J., Chow, P., Taylor, G., 31. Linglin, H., Li, Q., He, X., Maosong, L.: Research on pruning
Areibi, S.: Caffeinated fpgas: Fpga framework for convolutional algorithm of target detection model with yolov4. In: 2020 Chinese
neural networks. In: 2016 International Conference on Field- Automation Congress (CAC), pp. 3283–3287. IEEE (2020)
Programmable Technology (FPT), pp. 265–268. IEEE (2016) 32. Deng, C., Jing, D., Ding, Z., Han, Y.: Sparse channel pruning
12. Carballo-Hernández, W., Pelcat, M., Berry, F.: Why is fpga- and assistant distillation for faster aerial object detection. Remote
gpu heterogeneity the best option for embedded deep neural Sens. 14(21), 5347 (2022)
networks? (2021). arXiv:2102.01343 33. de Vinícius, P.V., Lisboa, A.C., Barbosa, A.V.: An automatic fire
13. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., detection system based on deep convolutional neural networks for
Jason, O.G.H., Liew, Y.T., Srivatsan, K., Moss, D., Subhaschan- low-power, resource-constrained devices. Neural Comput. Appl.
dra, S. et al.: Can fpgas beat gpus in accelerating next-genera- 34(18), 15349–15368 (2022)
tion deep neural networks? In: Proceedings of the 2017 ACM/ 34. Kumar, A., Shaikh, A.M., Li, Y., Bilal, H., Yin, B.: Pruning filters
SIGDA International Symposium on Field-programmable Gate with l1-norm and capped l1-norm for cnn compression. Appl.
Arrays, pp. 5–14 (2017) Intell. 51(2), 1152–1160 (2020)
14. Wei, G., Hou, Y., Cui, Q., Deng, G., Tao, X., Yao, Y.: Yolo 35. Nvtop.: Nvidia gpus htop like monitoring tool. https://g ithub.c om/
acceleration using fpga architecture. In: 2018 IEEE/CIC Inter- Syllo/nvtop
national Conference on Communications in China (ICCC), pp.
734–735. IEEE (2018) Publisher's Note Springer Nature remains neutral with regard to
15. Yap, J.W., bin Mohd Yussof, Z., bin Salim, S.I., Lim, K.C.: Fixed jurisdictional claims in published maps and institutional affiliations.
point implementation of tiny-yolo-v2 using opencl on fpga. Int.
J. Adv. Comput. Sci. Appl. 9(10) (2018) Springer Nature or its licensor (e.g. a society or other partner) holds
16. Ding, C., Wang, S., Liu, N., Xu, K., Wang, Y., Liang, Y.: Req- exclusive rights to this article under a publishing agreement with the
yolo: a resource-aware, efficient quantization framework for object author(s) or other rightsholder(s); author self-archiving of the accepted
detection on fpgas. In: Proceedings of the 2019 ACM/SIGDA manuscript version of this article is solely governed by the terms of
International Symposium on Field-programmable Gate Arrays, such publishing agreement and applicable law.
pp. 33–42 (2019)
17. Chen, C., Min, H., Peng, Y., Yang, Y., Wang, Z.: An intelligent
real-time object detection system on drones. Appl. Sci. 12(20), Dai Duong Nguyen received the
10227 (2022) Electrical Engineering degree in
18. Li, Wenhao, Hu, H.: Fpga-based object detection acceleration 2014 (mention in Industrial
architecture design. J. Phys. Conf. Ser. 2405, 012011 (2022) Information) from Hanoi Univer-
19. Shanyong, X., Zhou, Y., Huang, Y., Han, T.: Yolov4-tiny- sity of Science and Technology
based coal gangue image recognition and fpga implementation. - VietNam, M.S. degree in Infor-
Micromachines 13(11), 1983 (2022) mation, System and Technology
20. Zhang, Z., Mahmud, M.A.P., Kouzani, A.Z.: Resource-con- from Paris Sud University—
strained fpga implementation of yolov2. Neural Comput. Appl. France in 2015 and PhD degree
34(19), 16989–17006 (2022) in Robotics from Paris Sud Uni-
21. Zhang, F., Li, Y., Ye, Z.: Apply yolov4-tiny on an fpga-based versity in 2018. Currently, he is
accelerator of convolutional neural network for object detection. a lecturer at the School of Elec-
J. Phys. Conf. Ser. 2303, 012032 (2022) trical and Electronic Engineering
22. Zheng, X., He, T.: Reduced-parameter yolo-like object detector (SEEE), Hanoi University of
oriented to resource-constrained platform. Sensors 23(7), 3510 S c i e n c e a n d Te ch n o l o g y
(2023) (HUST). His research activities
23. Zhao, J., Zhang, J., Li, D., Wang, D.: Vision-based anti-uav detec- are focused on localization-based RF, visual SLAM and real-time
tion and tracking. IEEE Trans. Intell. Transport. Syst. 23(12), applications on embedded systems.
25323–25334 (2022)
24. Military aircraft detection dataset.: https://w ww.k aggle.c om/d atas Dang‑Tuan Nguyen received the
ets/a2015 003713/militaryaircraftdetectiondataset engineer degree in Electrical
25. Wang, Y., Wang, T. Zhou, X., Cai, W., Liu, R., Huang, M., Jing, Engineering from Hanoi Univer-
T., Lin, M., He, H., Wang, W., et al.: Transeffidet: aircraft detec- sity of Science and Technology
tion and classification in aerial images based on efficient det and (HUST), Vietnam on Mars 2023.
transformer. Comput. Intell. Neurosci. 2022 (2022) Currently, he is a researcher at
26. Flying-object dataset.: (2022). https://u niver se.r obofl ow.c om/n ew- the Control, Automation in Pro-
workspace-0k81p/flying_object_dataset duction and Improvement of
27. LabelImg Tzutalin.: Git code (2015). https://github.com/tzutalin/ Technology Institute (CAPITI),
labelImg. Accessed 2020 Apr Hanoi, Vietnam. His research
28. Netron.: https://github.com/lutzroeder/netron includes Computer Vision and
29. Xilinx Inc.: DPUCZDX8G for Zynq Ultrascale+ MPSoCs. Ver- real-time applications on embed-
sion PG338 (v3.4) (2022) ded systems.
30. Diganta, M.: Mish: a self regularized non-monotonic activation
function (2019). arXiv:1908.08681
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com