0% found this document useful (0 votes)

352 views

FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection

The document discusses implementing the YOLOv4 deep learning model for object detection on an FPGA chip. It generates a custom dataset of flying objects, trains and fine-tunes the YOLOv4 model on this dataset. The implementation is tested on a Xilinx ZCU104 FPGA board, showing competitive accuracy to CPU/GPU versions while significantly reducing power consumption and improving energy efficiency.

Uploaded by

Khaled Ismail

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

352 views

FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection

Uploaded by

Khaled Ismail

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Journal of Real-Time Image Processing (2024) 21:63

https://doi.org/10.1007/s11554-024-01440-w

RESEARCH

FPGA‑SoC implementation of YOLOv4 for flying‑object detection

Dai‑Duong Nguyen1 · Dang‑Tuan Nguyen2 · Minh‑Thuy Le1 · Quoc‑Cuong Nguyen1

Received: 7 July 2023 / Accepted: 17 February 2024

Abstract
Flying-object detection has become an increasingly attractive avenue for research, particularly with the rising prevalence of
unmanned aerial vehicle (UAV). Utilizing deep learning methods offers an effective means of detection with high accuracy.
Meanwhile, the demand to implement deep learning models on embedded devices is growing, fueled by the requirement for
capabilities that are both real-time and power efficient. FPGA have emerged as the optimal choice for its parallelism, flex-
ibility and energy efficiency. In this paper, we propose an FPGA-based design for YOLOv4 network to address the problem
of flying-object detection. Our proposed design explores and provides a suitable solution for overcoming the challenge of
limited floating-point resources while maintaining the accuracy and obtain real-time performance and energy efficiency. We
have generated an appropriate dataset of flying objects for implementing, training and fine-tuning the network parameters
base on this dataset, and then changing some suitable components in the YOLO networks to fit for the deployment on FPGA.
Our experiments in Xilinx ZCU104 development kit show that with our implementation, the accuracy is competitive with
the original model running on CPU and GPU despite the process of format conversion and model quantization. In terms of
speed, the FPGA implementation with the ZCU104 kit is inferior to the ultra high-end GPU, the RTX 2080Ti, but outper-
forms the GTX 1650. In terms of power consumption, the FPGA implementation is significantly lower than the GPU GTX
1650 about 3 times and about 7 times lower than RTX 2080Ti. In terms of energy efficiency, FPGA is completely superior
to GPU with 2–3 times more efficient than the RTX 2080Ti and 3–4 times that of the GTX 1650.

Keywords FPGA · YOLO · Neural network · UAV · Object detection · Vitis HLS

1 Introduction UAVs or drones have been widely used for many differ-
ent purposes. With increasingly modern technology, it is
In recent years, image-based analysis and processing algo- equipped with many advanced and flexible functions and
rithms have been a hot topic and received a lot of attention. its price is not too expensive. UAVs are used in many fields
Image processing is not only applied in civilian fields but with many different purposes and some of them can threaten
also widely in the military and healthcare fields. In these security. Therefore, it is necessary to detect illegal UAVs in
fields, aerial and remote sensing images are used to detect many different conditions and with diverse sizes. The detec-
aircraft targets in military bases and airfields, which are of tion of flying-object can be divided into detection methods
great significance for intelligence deployment. Recently, based on Radar [1], radio frequency waves [2], or image and
video [3]. In [4, 5], authors indicate that the image-based
detection methods are less sensitive to environmental noise
* Minh‑Thuy Le compared with other methods. In [6–8], deep learning is
thuy.leminh@hust.edu.vn
used for fast flying-object detection.
Dai‑Duong Nguyen Developing a deep learning model involves two main
duong.nguyendai@hust.edu.vn
phases: training and inference. Most of the applications
Quoc‑Cuong Nguyen typically run on general-purpose processors such as CPUs
cuong.nguyenquoc@hust.edu.vn
and especially GPUs because their architecture consists of
1
School of Electrical and Electronic Engineering, Hanoi hundreds of cores. This design feature made GPUs the cur-
University of Science and Technology, Hanoi, Vietnam rent hardware platform of choice for several machine learn-
2
Control Automation in Production and Improvement ing and deep learning applications. For the training process,
of Technology Institute, Hanoi, Vietnam

Vol.:(0123456789)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 2 of 19 Journal of Real-Time Image Processing (2024) 21:63

speed is not a priority but achieving the highest possible platforms. Otherwise, Yap [15] proposed the fixed-point (16-
accuracy to ensure the correct implementation of the model. bit) implementation of a CNN-based object detection model.
Normally, the training of the model takes a lot of resources Using an OpenCL high level implementation tool, the author
and time so it is done on the GPU. While providing a flex- has generated Tiny YOLOv2 synthesis on a Cyclone V
ible and agile computing platform for development, GPUs FPGA development board. The experiments show that on a
or CPUs are not optimized for fast inference and poor power working frequency of 100 MHz the proposal achieves peak
efficiency, especially in embedded or IoT applications. A performance of 21 GOPs. In 2019, REQ-YOLO, a resource-
solution is using a high-end server for machine learning aware, systematic weight quantization framework for object
inference. However, deploying at the edge has many benefits recognition, was presented by Ding [16]. The author sug-
compared to deploying with a server or cloud. The reason is gested a complete hardware implementation of block cir-
that data is processed immediately without having to upload culant matrices on CONV layers and developed an efficient
to the server for saving bandwidth, faster response time and processing element (PE) structure supporting the heteroge-
higher reliability [9]. neous weight quantization to enable real-time and highly
The FPGA is an excellent candidate that can solve these efficient implementations on FPGA. As a result, REQ-YOLO
problems with its powerful processing power, low power can greatly compress the YOLO model while only slightly
consumption, and low latency. FPGAs with their inherently degrading accuracy. Recently, in 2022, Chen [17] introduced
parallel architecture are suitable for ML and DL applications an intelligent real-time object detection system for drones
[10, 11], moreover, the FPGA can be reconfigured anytime. using FPGA. The neural network engine is based on the
Therefore, developers optimize and accelerate their systems FPGA and designed to accelerate NN models. According
by designing specialized hardware accelerators (hardware to experimental findings, the suggested FPGA architecture
accelerators) using FPGAs. As in [12] the authors use a het- effectively leverages FPGA computing resources, with cor-
erogeneous platform acceleration method (FPGA + GPU) responding DSP and LUT utilization rates of 81.56% and
that outperforms GPU acceleration. They show that CNN’s 72.80%. The system can detect objects at a rate of 8 frames
direct hardware mapping (DHM) on embedded FPGAs per second and uses less power by employing the YOLOv3-
outperforms GPU implementations in terms of power effi- tiny model for quick object detection. Wenhao et al. [18]
ciency (approximately 25% reduction) and execution time improved the YOLOv2 model by using the high concur-
(21% reduction in latency). A direct comparison between rent capabilities of CPU + FPGA structure. The accelerator
FPGA and GPU using DNN was done by [13] and the results makes use of floating-point quantization, adder optimization,
showed that the FPGA used (Stratix 10) can achieve more and various HLS optimization techniques. The PYNQ-Z2
GOP/s/Watt than the GPU (Nvidia Titan X), about 2.3–4.3 platform of Xilinx is used to conduct the test. According to
times. the author, the overall computing power of the accelerator
In this work, we propose an application-based neural net- is 27.1 GOP/s, the average detection accuracy is 80.6%, and
work running on FPGA to detect flying-object. Our contri- the overall power consumption is 2.609 W. Only 2% of the
bution can be stated as follows: (1) Build our custom dataset attained precision is lost when compared to CPU and GPU,
with different enhanced techniques to train our own network however, there is a significant reduction in power usage. A
parameters for 5 custom flying-objects; (2) Evaluate different coal gangue identification method based on YOLOv4-tiny
version of YOLOv3 and YOLOv4 to get the best network; was proposed by Xu [19] and implemented on the low-power
(3) Propose some modifications and adaptations to YOLOv4 hardware platform FPGA. First, the computer platform is
network to be suitable for implementation on FPGA; (4) used to train the YOLOv4-tiny model. The combination
Deploy our custom network on FPGA ZCU104 using Vitis of a BN layer and a convolution layer, along with a 16-bit
AI tool and make a comparison with GPU platforms. fixed-point quantization further minimizes the model’s
computation. Second, IP kernels for convolution and pool-
ing have been developed on the FPGA platform to acceler-
2 Related works ate convolution and pooling calculation. The proposal was
tested on the self-made coal gangue dataset and the result
In the literature, several researchers implemented YOLO in indicated that the precision for coal gangue recognition on
FPGA for different applications. In 2018, Wei et al. [14] the FPGA platform is slightly lower than those of CPU and
implemented the YOLO platform based on the Xillinx GPU. Hardware power consumption of the FPGA platform
Zynq board and optimized its architecture combined with is only 2.86 W and the energy efficiency ratio is 10.42 and
the FPGA feature. Real-time use cases of pedestrian and 3.47 times better than that of CPU and GPU, respectively.
vehicle recognition have been demonstrated. The results of On the other hand, Zhang [20] presented a resource-con-
testing show that the average time to recognize a picture on strained FPGA implementation of YOLOv2 with optimized
this platform is 51 ms, which is 46.9 times faster than for PC data transfer and computing efficiency. Firstly, on-chip data

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 3 of 19 63

transfer between different types of layers is allowed based on on detecting a single target class such as aircraft or UAVs
a scalable cross-layer dataflow strategy. The flexible off-chip or two classes such as UAVs and birds, which fail to exploit
data transfer is also offered when the intermediate results the detection and recognition capabilities of many targets.
are unaffordable on-chip. A filter-level data-reuse dataflow Therefore, in this paper, we built a data set of many different
strategy together with a filter-level parallel multiply-accumu- flying targets to demonstrate the model’s diverse and flex-
late operation computing processing elements array is then ible detection capabilities. For aerial targets, there are rela-
developed. Finally, multi-level sliding buffers are developed tively few available datasets that are published for research
to optimize the convolutional computing loop and reuse the purposes. Most of them are about detecting objects on the
input feature maps and weights. The author argued that this ground in aerial photographs from UAVs or satellite images.
implementation has achieved 4.8 W of low-power consump- Hence, it is necessary to build a dataset of aerial targets,
tion for executing YOLOv2, using low-resource of 8.3 Mbits adding more images to the dataset to diversify. We defined
on-chip memory, high throughput with 100.33 GOP/s and five popular target classes in our model as following: rotated
power efficiency is 20.90 GOP/s/W. In 2022, Zhang et al. wing UAV, fixed wing UAV, military helicopter, fighter air-
[21] introduced a design of FPGA-Based acceleration for craft and birds.
YOLOv4-Tiny object detection model by combining soft- We made use of four public datasets: the DUT Anti-UAV
ware and hardware. The author realized the accelerated Detection and Tracking dataset [23], the Military Aircraft
detection inference process from the original 6–7 mins to Detection dataset [24], the Military Aircraft Detection in
383 ms. First, the static quantization method of fixed-point Aerial Images dataset [25], the Flying Object dataset [26].
numbers is chosen, the position of the decimal point is However, we collected more images from public photos and
fixed and then the batch norm is added between the con- videos on the Internet to increase the diversity. For efficient
volutional layer and the activation function to form a con- data labeling for additional images, our method is that we
nection structure. Second, the inference speed on an FPGA train a pre-model based on the four public datasets. Here, we
with a version of ZYNQ-7020 is improved by increasing use the YOLOv4 network and then use this trained model
the bandwidth cap, reducing bandwidth requirements and to identify targets in the image that we collected to get the
employing a massive pipeline design. In the test of the Coco assigned labels of new images. We then check, correct and
dataset, the average inference speed of the YOLOv4-Tiny add missing or wrong labels with the labeling tool named
object detection model is reduced from 7.13 mins per frame LabelImg [27]. The description of four public datasets are
to 498.89 ms per frame while keeping the average accu- as follows:
racy value at around 95%. In the latest work in 2023, Zheng
[22] presented a class YOLO target detection algorithm and 1. Dataset DUT anti-UAV detection and tracking includes
deployed it to an FPGA platform. To enable the algorithm 10,000 images of 35 different types of quadrotors. Pho-
to run efficiently on FPGAs, the author quantized the model tos include many resolutions from the smallest 160 ×
and wrote the corresponding hardware operators based on 240 to 3744 × 5616. In addition, a variety of lighting
the model units. The proposed object detection accelerator conditions (day, night, sunrise,...) and different weather
has been implemented and verified on the Xilinx ZYNQ types (sunny, cloudy, snowy,...) are also considered.
UltraScale+ MPSoCs XCZU7EV platform using Vitis HLS. 2. Military aircraft detection dataset includes 10,300
Experimental results show that the detection accuracy of images with 41 different types of military aircraft. Most
the algorithm model after quantizing the network to 4 bits of the images are fighter aircraft and unmanned combat
is 87% which is comparable to that of common algorithms. aircraft. The number of images used in this dataset is
The power consumption is 5.508 W which is lower than that 4695.
of the CPU and GPU. 3. Military aircraft detection in aerial images (MADAI)
dataset includes 2558 images divided into 5 target
classes: bombers, civil aircraft, reconnaissance aircraft,
3 Neural network for flying‑object detection fighters and military helicopters. In this dataset, we used
using YOLO pictures of fighter aircraft and military helicopters. The
number of images used is 1123.
3.1 Datasets preparation 4. Flying object dataset contains 15,064 images with vari-
ous aircrafts, helicopters and birds flying in the air. The
Deep learning-based target detection has the challenge of number of images used in this dataset is 7087.
building enough data with annotated labels for training and
testing. Training with large datasets and diverse images In total, after aggregating the 4 datasets, there are 22,905
will give good results and prevent overfitting or underfit- images. We found that the birds and rotated wing UAV
ting. Researches applying deep learning models often focus target classes have a superior number compared to other

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 4 of 19 Journal of Real-Time Image Processing (2024) 21:63

target classes. The image quality of these two classes is Table 1 Feature classes distribution
also very high, the size of objects varies from very small Class Training Validation Total
to large compared to the image size, with a large number
of images having very small objects. For the goal of being Bird 9225 2165 11,390
able to detect and identify small targets, it is necessary to RW UAV 7143 1882 9025
add images for other objects classes, especially images FW UAV 7234 1748 8982
taken at long distances so that the target is small com- Fighter 6378 1505 7883
pared to image size. The classes of drones, fighters and Helicopter 4014 922 4936
helicopters can be named as examples. We proceeded to
collect more photos and cut image frames from videos in
the internet. After adding, the total number of images in and effectiveness in creating a representative split without
our dataset is 34,000 images where image size varies from introducing biases. From the chart in Fig. 2, we can see
640 × 512 to 8508 × 5939, Fig. 1 is some examples of our that the classes distribution in training and validation set
custom labeled dataset. The number of additional images is roughly equal, ensuring the split is unbiased.
is 11,095 images, of which 2000 images taken from sepa-
rate video sources are taken as the test set to ensure fair- We train two models YOLOv3 and YOLOv4 on the
ness when evaluating the best model in practice. The data training set, the quality scale of the model used is mAP0.5.
set will be divided into the training set, the validation set, The test set here will be used to evaluate each model after
and the test set. From 34,000 images, 2000 images men- having trained a certain number of iterations. We did the
tioned above will be taken as a test set to evaluate the final training until the maximum mAP0.5 is reached on the vali-
results of the model on real data. The remaining 32,000 dation set and the average loss does not decrease any more.
images are randomly divided into training and validation Since then, we have 2 sets of weights of the YOLOv3 and
sets in the commonly used ratio 8:2 as shown in Table 1, YOLOv4 models with the best mAP0.5 on the validation
this widely accepted practice is chosen for its simplicity set.

Fig. 1 Dataset examples

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 5 of 19 63

Fig. 2 Class frequencies in training and validation set

Fig. 3 YOLOv4-tiny model modification

3.2 YOLO networks adaptation for FPGA

module maxpool of the original YOLOv4 has kernel sizes of
The selected detection and recognition models for imple- 5, 9, 13 respectively while the DPU only supports maxpool
mentation are YOLOv3, YOLOv4 and shortened variants of kernel with a maximum size of 8. Hence, we have some solu-
the full version, YOLOv3-tiny, YOLOv4-tiny. This section tions such as: (1) Changing all maximum kernel sizes to 1,1,1;
deals with the necessary modifications to the model so that (2) Changing all kernel sizes to 5 × 5, 6 × 6, and 8 × 8; (3)
it can be deployed on all three platforms: GPU, CPU, and on Removing the SPP maxpool layer.
FPGA using Vitis AI tool. We also describe how to select Here we decided to change all kernel maxpool sizes to 5
hyperparameters and techniques to improve the accuracy of × 5, 6 × 6, and 8 × 8 to keep the SPP layer. The purpose of
the model. In fact, the deep-learning processor unit (DPU) SPP is to help the model learn features at different scales,
of Xilinx has some limitations on the classes, operators, and thereby helping the model to accurately detect objects of dif-
activation functions supported in the network. Therefore, ferent sizes. In theory, with a smaller kernel maxpool size than
we need to modify the model to be suitable for deployment the original model, the output feature map will be larger, the
on the DPU such as replacing, changing or removing some receptive field of the model will be smaller, and the network
unsupported classes. Without modification, when compiling will learn features at a smaller scale. It potentially leads to less
the model, unsupported classes, operators, and activation accuracy when detecting large objects, but potentially better
functions will be executed by the CPU thus reducing the accuracy when detecting small targets because more details
overall performance of the system. With the use of the Dark- can be preserved in the feature map. The computational cost
net framework to train the model, combined with Netron is also higher due to this change.
[28]—a visualization tool for neural network models, it is
easy to modify the model’s structure. According to Xilinx 3.2.2 YOLOv4‑tiny adaptation for FPGA
DPU documentation [29], YOLOv3 and YOLOv3-tiny
have a fully supported network topology so that only the A new feature in the original YOLOv4-tiny model is grouping
network topology of YOLOv4 and YOLOv4-tiny need to by channel. Feature map is divided into two groups and only
be modified. one of these groups is selected for further processing. The pur-
pose of this feature is to select the best feature map to reduce
3.2.1 YOLOv4 modifications for FPGA the computational requirement and minimize the size of the
model. However, DPU does not support this feature, so we will
Original YOLOv4 uses MISH activation function which is not remove it and replace by a standard route. The change is shown
supported by Xilinx DPU, so we replaced it by LeakyReLU in Fig. 3, on the left is the original YOLOv4-tiny model with
activation function. This change reduced a little bit the accu- segmentation group by channel and on the right is the model
racy of the model but not significantly. Following the research with this feature removed. This change does not affect model
published by [30] where the author made the performance accuracy, but the model size and calculation requirements may
comparison of these 2 activation functions on the MS-COCO be slightly larger.
dataset, mAP of the model only decreased by 1–2%. The SPP

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 6 of 19 Journal of Real-Time Image Processing (2024) 21:63

3.2.3 Classes and filters in the convolutional layer number of YOLO detection classes will be 3 which is equal
before the YOLO layer to YOLOv3 and YOLOv4 full model. It will help the detec-
tion model to be more accurate because each The YOLO
The number of filters in the convolutional layers before the layer in the network performs feature detection at a different
YOLO layer determines the number of output channels for spatial scale. Therefore, the additional layer will increase the
each predicted bounding box. The default number of filters spatial resolution of the feature map, allowing extraction of
in the convolutional layers is 255, which is consistent with more detailed characteristics leading to better capture of the
the original YOLO model trained on the COCO dataset with details of small targets. Of course, adding a YOLO layer will
80 feature classes. Our model has 5 feature classes so these make the model heavier and decrease the processing speed.
parameters need to be changed following the Eq. 1. We named these 2 models YOLOv3-tiny and YOLOv4-tiny
with one more layer as YOLOv3-tiny 3L and YOLOv4-tiny
filters = (classes + width + height + x + y + confidence) × num
3L.
(1)
where the 5 feature classes correspond to classes = 5. 3.3 Model hyperparameters selection
Since the YOLO layer predicts 3 bounding boxes for each
grid cell in the feature map so that num = 3, other oper- The learning rate is selected as the default value suggested
ands take the value of 1. Finally, filters take the value of by the YOLO network author. During training with the
(5 + 1 + 1 + 1 + 1 + 1) × 3 = 30 . This change ensures that default learning rate value does not achieve good results,
each YOLO class outputs the right bounding boxes to detect the value will be changed. Besides, training will be done
custom objects. for a number of iterations (max_batches) so choosing an
appropriate value for max_batches is very important to avoid
3.2.4 Adding a YOLO detection layer to the YOLO‑tiny overfitting or underfitting. The author of the YOLO network
version recommends setting max_batches=classes × 2000, where
classes are the number of feature classes, and max_batches
Due to the fact that the simplifying versions YOLOv3-tiny is not less than the number of images in the training set and
and YOLOv4-tiny have poor accuracy compared to the full not less than 6000. The purpose is to ensure that the training
versions YOLOv3 and YOLOv4, especially the ability to model has enough repetitions to learn the characteristics of
detect small targets. For the goal of detecting flying objects the targets and the objects. The number of iterations required
in our research, these two simplifying versions may not be for training depends on the complexity of the object, the
suitable even though they can achieve high FPS. Hence, it size of the dataset, and the optimizer’s learning rate. The
is necessary to improve the architecture of these two models steps value will be set to 90% of max_batches. After training
to increase the detection ability. For more accurate detection 90% max_batches, the learning rate will be multiplied by a
results with small targets, the simplest change is to add 1 default rate of 0.1, the purpose of this is to help the model
YOLO detection layer to these 2 models. converge faster.
The addition of 1 YOLO layer to the YOLOv4-tiny The choice of network size is also important because it
model is shown in Fig. 4, the original YOLO-tiny version directly affects the quality of the model. The size of the net-
only has 2 YOLO layers. The YOLO layer is responsible work is closely related to the ability to detect small targets,
for predicting bounding box and feature class probability. it is obvious that the larger the image size, the easier it is to
After having 1 additional YOLO detection layer, the total detect because small objects are represented in more pixels.

Fig. 4 The addition of 1 YOLO layer to the YOLOv4-tiny model

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 7 of 19 63

Usually, the default network size is selected as a 1:1 square The GPU used for training is a high-end GPU, so using an
ratio such as 416 × 416 or 640 × 640. However, few cameras additional low-end GPU will make the evaluation more
have such resolution but usually 4:3 or 16:9. We chose hence detailed and richer. Therefore, we use personal computer
the size 640 × 480 (VGA standard, 4:3 ratio) because this hardware with the configuration as follows:
is a very popular and common resolution. In practice, if we
choose a square ratio resolution, the image frame will be • GPU: NVIDIA GeForce GTX 1650 4GB GDDR6, 3.0
distorted when deployed in a real camera. The target will no TFLOPS computation capacity. TDP 50W.
longer keep the original scale due to the image resize, which • CPU: AMD Ryzen 5 5600H (6C/12T, 3.3/4.2GHz, 3MB
leads to incorrect commands and reduces the accuracy of L2/16MB L3). RAM: 16 GB DDR4-3200 MHz.
the model.
NMS threshold and object class probability threshold are The graph of the loss function and mAP of the models
two parameters that directly affect the prediction results of during training is shown in Fig. 5. During training, the aver-
the model. Therefore, choosing the optimal pair of these age loss value and mAPval 0.5
are calculated after each iteration.
two parameters is very important. The object class prob- All models will be evaluated with the best weights mAPtest 0.5
ability threshold (conf_threshold) determines the minimum on the test set. Speed is evaluated by performing inference 1
confidence score for a detection result to be considered video with HD resolution of 1280 × 720, 1 m:31 s duration,
as true positive. Detections with lower confidence scores 30fps, a total of 2747 frames. Testing configuration is batch
will be rejected because they are more likely to be false = 1 and subdivisions = 1. We get the final average FPS result
positives. The non-maximum suppression (NMS) thresh- after completing the video.
old associated with the NMS post-processing step removes Table 2 shows the results of evaluating the accuracy of
duplicate bounding boxes of the same object. This threshold the models on the validation set. It can be seen that the mod-
determines the overlap between two bounding boxes, which els have quite high results on the validation set, which is
bounding box will be removed. A lower threshold will result understandable because the weights of the selected mod-
in more bounding boxes being generated and a higher thresh- els are the weights with the highest results on the valida-
old will result in a lower number of bounding boxes with tion set. YOLOv4 and YOLOv3 models have quite similar
higher confidence. These 2 parameters need to be adjusted results, not significantly different. These 2 versions are sig-
to create a balance between precision and recall. For the nificantly superior to the shortened versions. The reduced
detection of small aerial targets, a low conf_threshold value models YOLOv3-tiny, YOLOv3-tiny 3L, YOLOv4-tiny, and
should be used to ensure as many detections as possible, YOLOv4-tiny 3L also has results that are not significantly
even if they have a low confidence score. The YOLO default different. Relying on the validation set alone is not enough to
value for conf_threshold is 0.25.Nms_threshold must also be draw conclusions about which model is the best, so it is nec-
a small value because small targets are more likely to overlap essary to evaluate the models on the test set. The test set is a
in the image. The YOLO default value for this threshold is set of unseen model images and is selected separately from
0.5. In order to choose the best value pair, we tried differ- other sources than the training/validation sets. The results
ent value pairs and evaluate the mAP, precision, recall, and of the accuracy evaluation of the models on the test set are
F1-score of the model with each pair of values to choose the shown in Table 3. The YOLOv4 model achieved outstanding
most optimal value pair. We only conducted this selection results on the accuracy scales compared to the remaining
for the best model. Before that, to choose the best model, models, showing an improvement of this version compared
the default values given by the YOLO author are considered to earlier versions of YOLOv3. From this table, we can also
where conf_threshold=0.25 and nms_threshold=0.5 (rated see that the effect of adding an extra layer of YOLO is evi-
mAP@0.5). dent with 2 models YOLOv3-tiny 3L and YOLOv4-tiny 3L.
The accuracy is improved compared to the version with only
3.4 Model training and evaluation 2 YOLO layers. The detailed result of AP for each target
class on the test set is shown in Table 4.
We decided to use the cloud platform VastAI (https://vast. In Table 5, it can be seen that YOLOv3-tiny is the model
ai/) to rent the available hardware to train the models. The that achieves the highest FPS on both high-end and low-
hardware configuration used for training is: end GPUs. The reason is that YOLOv3-tiny is the simplest
model and the trade-off is poor accuracy. All the rest of
• GPU: NVIDIA GeForce RTX 2080 Ti 11GB VRAM the minified models (YOLOv3-tiny 3L, YOLOv4-tiny and
(GPU RAM), 19 TFLOPS (TeraFLOPS) computation YOLOv4-tiny 3L) also have high FPS and small model sizes
capacity. TDP (Thermal Design Power) 250 W. and can run in real-time on both 2 GPU platforms. The two
• CPU: AMD Ryzen Threadripper PRO 3975WX full models YOLOv3 and YOLOv4 can meet real-time
32-Cores. RAM: 37 GB. requirements on high-end GPUs. However, with low-end

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 8 of 19 Journal of Real-Time Image Processing (2024) 21:63

Fig. 5 Graphs of the loss function and mAP of the different models during training

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 9 of 19 63

GPUs like GTX1650, these two models have quite low FPS. the efficiency of the model without too much decrease in
In short, YOLOv4-tiny 3L is a model with pretty good accu- accuracy. The implementation method in our work is refer-
racy compared to the rest of the reduced models. It is of enced to ideas from the authors in [33] with the input of the
course not comparable to the full YOLOv4 model but can weights of the trained model. Illustrate the filter parameter
be run in real time with even low-end GPUs is GTX 1650. pruning step in Fig. 6, the significance of filters, represented
Therefore, we choose 2 models YOLOv4 and YOLOv4-tiny by blue and green rectangles, is assessed through the compu-
3L to continue deploying on FPGA. YOLOv4 is a large, tation of their L1-norm. Filters with lower values are chosen
complex and very heavy network model so it is necessary for pruning, as indicated by the green circle. The pruned
to reduce the computational complexity in order to meet model experiences a reduction in accuracy, prompting the
real-time performance and energy-efficient requirements on subsequent fine-tuning process. On the Darknet, the pruning
FPGA. process includes the following steps:

3.5 Model compression with the pruning method • Iterating through each convolution layer of the trained
model and calculate the L1 normalization of the weights
To reduce the size and computational requirements of the for each filter in the class.
model, many methods can be considered, such as reducing • Sorting the filters in each layer in descending order
the size (resolution) of the output and input of the network or according to their L1 norm.
reducing the number of layers of the model. However, with • Removing the filters with the lowest L1 norm at a preset
these two methods, the accuracy and ability to detect small ratio from each layer (set the weights of these filters to
targets of the model will be seriously affected. In our work, 0).
we chose an alternative called pruning method. Pruning is • Adjusting the input and output channels of subsequent
known as removing network nodes that do not contribute to layers to maintain the structure of the network.
detection. Pruning with an appropriate technique can slightly • Saving the trimmed model
reduce the accuracy of the model (or even improve accu-
racy in some cases) and make the model lighter and faster. We experimented with different pruning ratio values to
A typical pruning process consists of three stages: Firstly, find the most optimal value (pruning ratio 0.5 means that
training a large, overparameterized model (with too many 50% of filters with the lowest standard L1 value are removed
unnecessary parameters); Secondly, pruning the trained from the network). The model after pruning will be fine-
large model according to a certain criterion; Thirdly, fine- tuned for an additional 30,000 iterations with a learning rate
tuning the trimmed model to restore accuracy. The pruning of 0.05 to recover the accuracy. The pruning results with
techniques can be classified into three main groups: different scale values are shown in Table 6. We note that
the pruning ratio of 0.6 achieved the best results. The total
• Criteria-based: L0-Norm, L1- Norm, L2-Norm, L-Inf parameter of the model and the BFLOPS calculation cost
Norm and Random. decreased by 51%, mAPval 0.5
and mAPtest
0.5
were restored close
• Projection-based: PLS (Single) + VIP, PLS (Multi) + to the original model. With a higher pruning ratio of 0.7,
VIP, CCA (Multi) + CV and PLS (Multi) + LC. the model suffered a significant reduction in accuracy while
• Cluster-based: HAC+PCC. with a lower scale, the accuracy did not increase much. The
YOLOv4 model with a scale of 0.6 will be called YOLOv4-
For the YOLOv4 model, we selected the pruning technique pruned to distinguish it from the original YOLOv4 model
based on the standard with the L1 standard. The L1-based is and is selected for further deployment on the FPGA.
widely used and has been successfully applied to many deep Another problem that has been mentioned in Sect. 3.3
learning models, showing efficiency and flexibility. This is the selection of the conf_threshold and nms_threshold
method is also simple and easy to implement and does not parameter pairs. We have chosen YOLOv4, YOLOv4-
require adjustment of the model’s hyperparameters. Some pruned and YOLOv4-tiny 3L models to deploy FPGA. We
studies in state-of-the-art made comparison of different chose YOLOv4 model to find the best value pair, then evalu-
pruning techniques for the YOLOv4 model [4, 31–33] and ate the remaining 2 models on the found pair. Through the
achieved the best results with the pruning method based on survey from keeping one value and changing the remaining
the L1/L2 standard. In fact, L1 norm is also a type of regu- value (with a 0.05 step), in Table 7, we found that conf_
larization technique commonly used in machine learning and threshold = 0.15 is the most optimal value. If this value is
deep learning models to prevent overfitting. In the pruning set smaller, precision will decrease significantly with a lot of
technique, the L1 norm is used to determine the least sig- FP detection. Otherwise, in Table 8, the value nms_thresh-
nificant weights in the neural network and set them to 0, old = 0.25 is the most optimal value. When nms_threshold
which reduces the total number of parameters and improves is smaller than this value, with the same target, the model

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 10 of 19 Journal of Real-Time Image Processing (2024) 21:63

Table 2 Models evaluation on the validation set will produce many overlapping bounding boxes with the
Model Precision Recall F1-score
same target. In short, with the pair of values conf_threshold
mAPval (%)
0.5
= 0.15 and nms_threshold = 0.25, the model achieved the
YOLOv3 92.07 0.91 0.89 0.90 best tradeoff between precision and recall. The mAP of all
YOLOv3-tiny 80.01 0.83 0.70 0.76 models are improved (increased by about 2% compared to
YOLOv3-tiny 3L 83.70 0.85 0.72 0.78 baseline). The precision is decreased slightly or remained
YOLOv4 93.0 0.89 0.91 0.90 constant and recall was significantly improved as shown in
YOLOv4-tiny 81.76 0.83 0.70 0.76 Table 9. The experiments also show that this pair of values is
YOLOv4-tiny 3L 83.8 0.86 0.71 0.78 suitable for detecting smaller targets. With the reduction of
nms_threshold from 0.5 to 0.25, the evaluated mAP results
The bold value of each column is the optimal value for the criteria
when comparing among the different yolo network versions will be mAP0.25 instead of mAP0.5.
Figure 7 shows some samples of the detection using
GPU. From left to right are the models: YOLOv4, YOLOv4-
Table 3 Models evaluation on the test set pruned and YOLOv4-tiny 3L. It can be noted that the
YOLOv4 model detects small targets relatively well with
Model mAPtest (%) Precision Recall F1-score
0.5 the definition of small targets used (16 × 16 to 42 × 42 pix-
YOLOv3 80.20 0.83 0.71 0.77 els in the 640 × 480 image, and 27 × 27 to 73 × 73 pixels
YOLOv3-tiny 70.43 0.74 0.65 0.69 in 1280 × 720 images), having high confidence scores and
YOLOv3-tiny 3L 75.06 0.78 0.67 0.72 less confusion among target classes. YOLOv4-pruned gives
YOLOv4 88.53 0.92 0.80 0.86 results comparable to YOLOv4. YOLOv4-tiny 3L can detect
YOLOv4-tiny 74.11 0.78 0.70 0.74 close and medium-range targets well but still has difficulty
YOLOv4-tiny 3L 77.90 0.80 0.73 0.76 for small targets, which are more easily confused and missed
than the 2 remaining models. For extremely small targets in
The bold value of each column is the optimal value for the criteria
the image such as a few pixels, all the 3 models are either
when comparing among the different yolo network versions
undetectable or detectable but with mistaken for another
target class.
Table 4 AP of each target class on test set
Model AP for each class (%) 4 FPGA‑Soc implementation of YOLOv4
Bird FW UAV RW UAV Fighter Helicopter
After selecting the 3 best models (YOLOv4, YOLOv4-
YOLOv3 73.90 82.97 81.46 80.98 81.67 pruned and YOLOv4-tiny 3L), the next stage is to deploy
YOLOv3 64.85 73.25 76.71 68.15 69.18 these models on FPGAs with the ZCU104 kit and the Vitis
YOLOv3 75.26 75.80 77.57 74.94 71.74 AI platform.
YOLOv4 85.50 92.60 92.48 84.22 87.86
YOLOv4 73.59 67.75 81.83 72.22 75.17 4.1 Converting model formats from Darknet
YOLOv4 76.04 76.84 82.08 77.13 77.43 to Tensorflow

Vitis AI does not support the Darknet framework so it is

Table 5 Average FPS and other parameters of the model running on necessary to convert the weights file of the model to a for-
the GPU mat compatible with Tensorflow. We can convert through
Model No. layer BFLOPS Weights AvgFPS AvgFPS 2 steps that the first is convert from Darknet weights file
size (MB) RTX2080 GTX1650 (.weights extension) to Keras weights file (.h5 extension).
The weights are stored in converted binary format in type
YOLOv3 107 115.977 234 80.4 17.5
of floating point values (in blocks of 4 bytes) suitable
YOLOv3 23 9.681 33.1 361 142.8
for Keras. After that, the second step is from converted
YOLOv3 30 12.627 34.4 304.7 120.9
Keras model to TensorFlow frozen graph (.pb extension).
YOLOv4 157 105.900 245 66.9 16.5
A graph or a computational graph is a collection of Ten-
YOLOv4 38 13.121 23.2 323.4 129.4
sorflow calculations and operators that are placed into a
YOLOv4 45 15.321 24.1 274 114.0
graph. A Frozen graph is a graph converted to a serial-
The bold value of each column is the optimal value for the criteria ized binary format that can be loaded into a TensorFlow
when comparing among the different yolo network versions graph without an initial model definition. The resulting

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 11 of 19 63

Fig. 6 Illustration of L1-norm based filter pruning [34]

Table 6 Pruning model Pruning ratio Total BFLOPS Weights mAPtest (%) FPS
mAPval (%)
evaluation with different scale parameters size (MB)
0.5 0.5

values (M)
RTX2080 GTX1650

Original 64.2 105.9 245 93.00 88.53 66.9 16.5

0.5 36.1 59.7 138 92.90 87.84 82.2 21.7
0.6 31.5 52.1 120 92.86 87.68 84.3 23.3
0.7 27.2 45.0 104 61.30 59.41 85.6 24.4

Table 7 Conf_threshold Conf. thres Precision Recall F1 TP FP FN Avg IoU (%)

analysis with nms_threshold
= 0.5 0.25 0.92 0.80 0.86 4714 412 1135 88.44
0.2 0.90 0.82 0.86 4786 553 1063 85.91
0.15 0.87 0.83 0.85 4865 749 984 82.78
0.1 0.82 0.85 0.83 4959 1096 890 77.90
0.05 0.73 0.87 0.79 5087 1924 762 68.63

Table 8 Nms_threshold analysis NMS thres. Precision Recall F1 TP FP FN Avg IoU (%)
with conf_threshold = 0.25
0.5 0.92 0.80 0.86 4714 412 1135 88.44
0.45 0.92 0.81 0.86 4723 403 1126 88.49
0.4 0.92 0.81 0.86 4732 394 1117 88.56
0.35 0.92 0.81 0.86 4740 386 1109 88.57
0.3 0.93 0.81 0.86 4743 383 1106 88.56
0.25 0.93 0.81 0.87 4747 379 1102 88.58
0.2 0.93 0.81 0.87 4746 380 1103 88.56
0.15 0.93 0.81 0.87 4747 379 1102 88.57
0.1 0.93 0.81 0.87 4747 379 1102 88.57
0.05 0.93 0.81 0.87 4745 381 1104 88.53

frozen graph is saved to the pb file. The pb file contains network. The weights are stored as constants in the graph,
the TensorFlow graph definition, including the model which means they cannot be modified during process-
architecture, weights, and any other variables needed for ing. This format conversion step did not affect too much

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 12 of 19 Journal of Real-Time Image Processing (2024) 21:63

Table 9 Models evaluation with conf_threshold = 0.15 and nms the accuracy of the original model, almost negligible as
threshold = 0.25 shown in Table 10.
Model Precision Recall F1 mAPtest
0.25
(%)
4.2 Model quantization
YOLOv4 0.88 0.84 0.86 90.37
YOLOv4-pruned 0.85 0.86 0.85 89.02
There are two main quantization methods, PTQ and QAT.
YOLOv4-tiny 3L 0.75 0.77 0.76 78.95
Vitis AI Quantizer supports both of these methods with Ten-
sorFlow 1.15. PTQ does not require retraining or labeled
data. In most cases, the PTQ method is sufficient to achieve
8-bit quantization with accuracy similar to that of 32-bit
float. Otherwise, QTA requires fine-tuning and labeled
training data but allows lower bit quantization with possibly

Fig. 7 Detection performance on GPU (YOLOv4, YOLOv4-pruned and YOLOv4-tiny 3L)

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 13 of 19 63

Table 10 mAP evaluation of the transformation model and the origi- which provides APIs for communication between the DPU
nal model and the microprocessor, supported languages are C++ and
Model mAPval (%) mAPval (%) Python. In this project, the executable program is written in
0.25 0.25
Darknet TF frozen graph C++ language.
For a program that reads image frames from a camera or
YOLOv4 95.54 95.50 video for processing and display, the purpose of this pro-
YOLOv4-pruned 95.39 95.36 gram is to test experimentally whether the implementation
YOLOv4-tiny 3L 89.06 89.00 with the camera can meet the real-time requirements and
measure the FPS rate. We found that the time to perform the
pre- or post-processing is very small compared to the DPU
better model accuracy. In this project, we will use PTQ computation. Therefore, instead of using only 1 DPU core,
method. This step is performed using the Vitis AI quan- the frame processing program from the camera is designed
tizer which takes as input a 32-bit float model, performs to take advantage of 2 DPU cores to improve processing
preprocessing (combining batchnorm functions and removes speed. In fact, one DPU IP can support up to 4 DPU cores
unnecessary nodes for inference), then converts the weights/ but with the resources of the ZCU104 kit, only 2 DPU cores
biases and trigger values to the specified bit width (here 8 can be met. With 2 DPU cores running in parallel, we can
integer bits). make inferences with 2 images at the same time. The pro-
To collect activation statistics and improve the accuracy gram on CPU is hence designed in a multi-thread method
of quantized models, the Vitis AI quantizer must run sev- to be able to process many frames at the same time. Each
eral inference iterations to calibrate the activation values. thread is responsible for processing a separate input frame
Therefore, a calibration image dataset is required as input. sequence, performing preprocessing and inference using its
As recommended by Xilinx, quantization works well with own DPU core. In this way, the system can improve overall
100–1000 images. Back propagation is not required so that throughput and reduces latency. An illustration of the pro-
an unlabeled data set is sufficient. Here, we use 640 images gram workflow with 2 threads is shown in Fig. 8.
randomly taken from the training/testing set as the calibra- To achieve the highest performance (the highest FPS), it
tion set. After calibration, the quantized model is trans- is necessary to test the program with different thread num-
formed into a DPU deployable model that conforms to the bers to find the optimal number of threads. Multiple threads
DPU’s data format. This model can then be compiled by the is not necessarily good as it will increase the power con-
Vitis AI compiler and deployed to the DPU. sumption of the system and may not be able to effectively
use 2 DPU cores. The experiment on different number of
4.3 Program flow threads to find the optimal value will be covered in Sect. 5.2.

This is the final step to deploy the compiled model. To

be able to run this model on an operating system board, 5 Experiments
a software program (application) is required that handles
the communication between the processing system (PS) and The ZCU104 development kit allows the design of embed-
the hardware core (PL) containing the DPU. This program ded vision applications such as surveillance systems, driver
needs to read input image from PS, perform preprocessing, assistance systems (ADAS), machine vision, augmented
provide preprocessed image to DPU on FPGA and read out- reality (AR), unmanned aerial vehicles and medical image
put from DPU and do post-processing. Xilinx supports the processing. The MPSoC kit XCZU7EV-2FFVC1156E from
construction of such programs using the Vitis AI Library, the Zynq UltraScale+ MPSoC EV series supports a variety

Fig. 8 Working load with 2 threads

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 14 of 19 Journal of Real-Time Image Processing (2024) 21:63

of the detection results of the FPGA implementation is pre-

sented in Fig. 10. We can observe that the result is slightly
worse than the implementation on GPU, but not signifi-
cantly. The ability to detect small targets is still guaranteed.

5.2 Processing time

On FPGA, the speed is also evaluated by the same video

as on GPU to ensure fairness (video with HD resolution
of 1280 × 720, length of 01:31 min, 30 fps, a total of 2747
frames). We get the final average FPS (frame per second)
result after completing the video processing. We also evalu-
ated the speed of the program when running with different
numbers of threads to find the optimal value. The execution
time of the whole process, on the DPU and on the micro-
processor is measured when the program runs with a single
thread to evaluate the processing time of each component.
When running with multiple threads, we just measured FPS.
In Table 12, E2E_MEAN, DPU_MEAN and CPU_
MEAN are the average processing times of an image for the
Fig. 9 ZCU104 development kit whole process, of the DPU and of the CPU, respectively.
These intervals are measured when the program runs with
only 1 thread. It can be seen that the average processing time
Table 11 mAP (%) of YOLOv4 models implemented on FPGA of the DPU dominates the entire process, while the average
Model mAPtest mAPtest
processing time of the CPU is very small compared to the
mAPval mAPval
0.25 0.25 0.25 0.25
whole process. Table 13 shows that when running with 4
FPGA FPGA
threads, the models achieved the highest FPS. This could be
YOLOv4 95.54 88.21 90.37 83.70 explained that the program took advantage of the four ARM
YOLOv4-pruned 95.39 88.13 89.02 82.41 Cortex-A53 processor cores on the ZCU104 board. Testing
YOLOv4-tiny 3L 89.06 82.02 78.50 72.24 with number of threads greater than 4 results in lower FPS.
From the above results, we can note that the YOLOv4,
even when running with 4 threads and 2 DPU cores, only
of peripherals and common couplings for embedded vision achieves about 18 FPS. On the other hand, with YOLOv4-
applications. The accompanying ZU7EV is equipped with a pruned, the speed can reach 29 FPS. The shortened version
quad-core ARM Cortex-A53 application processor, a dual- of YOLOv4 is YOLOv4-tiny 3L which can run up to 125
core Cortex-R5 real-time processor, a Mali-400 MP2 graph- FPS but at the expense of less accuracy.
ics processor and programmable logic. An example of our
experiments is illustrated in Fig. 9.
5.3 Power consumption

5.1 Accuracy evaluation For measuring power consumption in different hardware, we

used a measuring device called a socket-type power meter
The implementation process on the model FPGA must go TS-836. The power value W (Watt) will be displayed on
through the format conversion and quantization process, it is the display of the measuring device. The method is to make
hence inevitable that the accuracy will be reduced. We eval- power measurements during the video processing. We take
uated the model after these steps. The scale used is mAP0.25 the values every 3 s and compute average value after com-
with validation set and test set, and conf_threshold takes pleting the video. This method is applied with the actual
the value of 0.15. The Table 11 shows that after the steps hardware configuration available which is the GTX 1650
of converting the model to Tensorflow format, quantifying GPU on our personal laptop and the ZCU104 FPGA kit. For
the model and compiling the model, the mAP is reduced by RTX 2080 Ti GPU hardware that we rent online in the cloud,
about 6–7% depending on the model. This is acceptable due the measurement method is using specialized software to
to many processes of transforming and quantizing the model measure real-time power consumption for AMD, Intel and
from a 32-bit floating down to an 8-bit integer. The samples NVIDIA GPUs. It is called NVTOP [35]. The software

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 15 of 19 63

Fig. 10 Detection result on FPGA (from left to right: YOLOv4, YOLOv4-pruned and YOLOv4-tiny 3L)

Table 12 Average processing time of each algorithm parts method cannot be as accurate as the hardware method but
Model E2E_MEAN (ms) DPU_ CPU_
gives us an approximation. The average power consumption
MEAN (ms) MEAN of the RTX 2080Ti GPU and GTX 1650 GPU and ZCU104
(ms) kit without our application processing are 22 W, 28 W and
16.1 W respectively.
YOLOv4 107.94 104.0 3.94
Table 14 presents the power consumption of the 2 GPU
YOLOv4-pruned 66.53 62.60 3.93
and Table 15 is the consumption of ZCU104 by differ-
YOLOv4-tiny 3L 19.18 15.27 3.91
ent number of threads. It is obviously that the executable
program runs with more threads leading to the higher the
power consumption. With only 1 thread, only 1 DPU core
is used, the power consumption is hence the lowest. Power
consumption increases significantly starting from running

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 16 of 19 Journal of Real-Time Image Processing (2024) 21:63

Table 13 FPS evaluation with Model Number of threads

different number of threads
1 2 3 4 5 6

YOLOv4 9.2 17.3 17.6 17.9 17.3 17.1

YOLOv4-pruned 15.0 27.4 28.3 29.0 28.4 28.3
YOLOv4-tiny 3L 52.1 100.4 112.8 125.0 124.1 123.7

Table 14 Average power consumption of GPU with 2 threads because 2 DPU cores are used. Combining
Model Power consumption (W)
with the results of the FPS evaluation in previous section, it
can be noted that when the executable program runs with the
RTX 2080 Ti GTX 1650 number of threads greater than 4, the FPS is not improved
YOLOv4 255.4 93.7 but system consumes more power. Therefore, we use FPGA
YOLOv4-pruned 220.8 84.1 implementation with 4 threads to compare with the imple-
YOLOv4-tiny 3L 180.5 76.3 mentation on the GPU.

Table 15 Average power Model Number of threads

consumption (W) of ZCU
depending on number of threads 1 2 3 4 5 6

YOLOv4 23.1 29.1 29.4 30.1 30.5 31.2

YOLOv4-pruned 21.2 27.1 27.4 27.9 28.6 29.0
YOLOv4-tiny 3L 19.4 25.8 26.1 26.4 27.0 27.8

Fig. 11 Performance com-

parison of deploying models on
different platforms

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 17 of 19 63

Fig. 12 FPS/W power efficiency

on different platforms

Figure 11 summarizes the results of models deployed consuming significantly less power than GPU implementa-
on GPU and FPGA hardware platforms. Figure 12 com- tion. This makes FPGAs well suited for embedded vision
pares energy efficiency (FPS/W). In terms of speed, the applications requiring high power efficiency. With the appli-
FPGA implementation with the ZCU104 kit is inferior to cation of appropriate model compression methods, complex
the high-end RTX 2080Ti GPU, but outperforms the low- and heavy deep learning models can also achieve real-time
end GTX 1650 GPU. In terms of power consumption, the performance on FPGA.
FPGA implementation is significantly outperforms the GPU,
which is lower than GTX 1650 about 3 times and lower than
Funding This work is funded under project number B2024-BKA-08.
RTX 2080Ti about 7–8 times. In terms of FPS/W energy
efficiency, FPGA is also superior to GPU by 2–3 times more Data availability The data is included in the manuscript.
efficient than the RTX 2080Ti and by 3–4 times than that
of the GTX 1650. The efficiency of the pruning method is
clearly shown with the YOLOv4 model implemented on References
ZCU104 where the pruned model has a negligible decrease
in accuracy (mAPval0.25
is 88.13% compared to 88.21% before 1. Coluccia, A., Parisi, G., Fascista, A.: Detection and classifica-
pruning and mAPtest 82.41% compared to 83.70% before tion of multirotor drones in radar sensor networks: a review.
0.25
Sensors 20(15), 4172 (2020)
pruning) but the FPS is greatly improved (from 17.9 to 29). 2. Martian, A., Chiper, F.-L., Craciunescu, R., Vladeanu, C., Fratu,
The FPS/W power efficiency is also improved compared to O., Marghescu, I.: Rf based uav detection and defense systems:
original model (1.04 vs 0.6, about 73% more efficient). Survey and a novel solution. In: 2021 IEEE International Black
Sea Conference on Communications and Networking (Black-
SeaCom), pp. 1–4. IEEE (2021)
3. Dewangan, V., Saxena, A., Thakur, R., Tripathi, S.: Application
of image processing techniques for uav detection using deep
6 Conclusion learning and distance-wise analysis. Drones 7(3), 174 (2023)
4. Liu, H., Fan, K., Ouyang, Q., Li, N.: Real-time small drones
detection based on pruned yolov4. Sensors 21(10), 3374 (2021)
In this research, we presented an application based neural 5. Liu, B., Luo, H.: An improved yolov5 for multi-rotor uav detec-
network and its implementation on FPGA. We enhanced our tion. Electronics 11(15), 2330 (2022)
training dataset from various sources with different tech- 6. Mamdouh, N., Khattab, A.: Yolo-based deep learning frame-
niques. Our system can detect 5 well-known flying-object work for olive fruit fly detection and counting. IEEE Access 9,
84252–84262 (2021)
at a high accuracy. With our FPGA proposal, the accuracy 7. Jiang, C., Ren, H., Ye, X., Zhu, J., Zeng, H., Nan, Y., Sun, M.,
of the YOLOv4 network is reduced at an acceptable level Ren, X., Huo, H.: Object detection from uav thermal infrared
due to the process of format conversion and model quantiza- images and videos using yolo models. Int. J. Appl. Earth Obs.
tion. In terms of speed, the FPGA implementation with the Geoinf. 112, 102912 (2022)
8. Diwan, T., Anirudh, G., Tembhurne, J.V.: Object detection using
ZCU104 kit is inferior to the high-end RTX 2080Ti GPU, yolo: challenges, architectural successors, datasets and applica-
but outperforms the low-end GTX 1650 GPU. In terms of tions. Multimed. Tools Appl. 82(6), 9243–9275 (2023)
power consumption, the FPGA implementation is signifi- 9. Crockett, L., Northcote, D., Ramsay, C., Robinson, F., Stewart,
cantly lower than the GPU. Regarding the FPS/W energy R.: Exploring Zynq MPSoC: With PYNQ and machine learning
applications (2019)
efficiency, FPGA is completely superior to GPU by 2–3 10. Chen, R., Tianyu, W., Zheng, Y., Ling, M.: Mlof: machine
times more efficient than the RTX 2080Ti and by 3-4 times learning accelerators for the low-cost fpga platforms. Appl. Sci.
than that of the GTX 1650. In conclusion, our applica- 12(1), 89 (2022)
tion developed on FPGA can handle real-time speed while

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

63 Page 18 of 19 Journal of Real-Time Image Processing (2024) 21:63

11. DiCecco, R., Lacey, G., Vasiljevic, J., Chow, P., Taylor, G., 31. Linglin, H., Li, Q., He, X., Maosong, L.: Research on pruning
Areibi, S.: Caffeinated fpgas: Fpga framework for convolutional algorithm of target detection model with yolov4. In: 2020 Chinese
neural networks. In: 2016 International Conference on Field- Automation Congress (CAC), pp. 3283–3287. IEEE (2020)
Programmable Technology (FPT), pp. 265–268. IEEE (2016) 32. Deng, C., Jing, D., Ding, Z., Han, Y.: Sparse channel pruning
12. Carballo-Hernández, W., Pelcat, M., Berry, F.: Why is fpga- and assistant distillation for faster aerial object detection. Remote
gpu heterogeneity the best option for embedded deep neural Sens. 14(21), 5347 (2022)
networks? (2021). arXiv:2102.01343 33. de Vinícius, P.V., Lisboa, A.C., Barbosa, A.V.: An automatic fire
13. Nurvitadhi, E., Venkatesh, G., Sim, J., Marr, D., Huang, R., detection system based on deep convolutional neural networks for
Jason, O.G.H., Liew, Y.T., Srivatsan, K., Moss, D., Subhaschan- low-power, resource-constrained devices. Neural Comput. Appl.
dra, S. et al.: Can fpgas beat gpus in accelerating next-genera- 34(18), 15349–15368 (2022)
tion deep neural networks? In: Proceedings of the 2017 ACM/ 34. Kumar, A., Shaikh, A.M., Li, Y., Bilal, H., Yin, B.: Pruning filters
SIGDA International Symposium on Field-programmable Gate with l1-norm and capped l1-norm for cnn compression. Appl.
Arrays, pp. 5–14 (2017) Intell. 51(2), 1152–1160 (2020)
14. Wei, G., Hou, Y., Cui, Q., Deng, G., Tao, X., Yao, Y.: Yolo 35. Nvtop.: Nvidia gpus htop like monitoring tool. https://g ithub.c om/
acceleration using fpga architecture. In: 2018 IEEE/CIC Inter- Syllo/nvtop
national Conference on Communications in China (ICCC), pp.
734–735. IEEE (2018) Publisher's Note Springer Nature remains neutral with regard to
15. Yap, J.W., bin Mohd Yussof, Z., bin Salim, S.I., Lim, K.C.: Fixed jurisdictional claims in published maps and institutional affiliations.
point implementation of tiny-yolo-v2 using opencl on fpga. Int.
J. Adv. Comput. Sci. Appl. 9(10) (2018) Springer Nature or its licensor (e.g. a society or other partner) holds
16. Ding, C., Wang, S., Liu, N., Xu, K., Wang, Y., Liang, Y.: Req- exclusive rights to this article under a publishing agreement with the
yolo: a resource-aware, efficient quantization framework for object author(s) or other rightsholder(s); author self-archiving of the accepted
detection on fpgas. In: Proceedings of the 2019 ACM/SIGDA manuscript version of this article is solely governed by the terms of
International Symposium on Field-programmable Gate Arrays, such publishing agreement and applicable law.
pp. 33–42 (2019)
17. Chen, C., Min, H., Peng, Y., Yang, Y., Wang, Z.: An intelligent
real-time object detection system on drones. Appl. Sci. 12(20), Dai Duong Nguyen received the
10227 (2022) Electrical Engineering degree in
18. Li, Wenhao, Hu, H.: Fpga-based object detection acceleration 2014 (mention in Industrial
architecture design. J. Phys. Conf. Ser. 2405, 012011 (2022) Information) from Hanoi Univer-
19. Shanyong, X., Zhou, Y., Huang, Y., Han, T.: Yolov4-tiny- sity of Science and Technology
based coal gangue image recognition and fpga implementation. - VietNam, M.S. degree in Infor-
Micromachines 13(11), 1983 (2022) mation, System and Technology
20. Zhang, Z., Mahmud, M.A.P., Kouzani, A.Z.: Resource-con- from Paris Sud University—
strained fpga implementation of yolov2. Neural Comput. Appl. France in 2015 and PhD degree
34(19), 16989–17006 (2022) in Robotics from Paris Sud Uni-
21. Zhang, F., Li, Y., Ye, Z.: Apply yolov4-tiny on an fpga-based versity in 2018. Currently, he is
accelerator of convolutional neural network for object detection. a lecturer at the School of Elec-
J. Phys. Conf. Ser. 2303, 012032 (2022) trical and Electronic Engineering
22. Zheng, X., He, T.: Reduced-parameter yolo-like object detector (SEEE), Hanoi University of
oriented to resource-constrained platform. Sensors 23(7), 3510 S c i e n c e a n d Te ch n o l o g y
(2023) (HUST). His research activities
23. Zhao, J., Zhang, J., Li, D., Wang, D.: Vision-based anti-uav detec- are focused on localization-based RF, visual SLAM and real-time
tion and tracking. IEEE Trans. Intell. Transport. Syst. 23(12), applications on embedded systems.
25323–25334 (2022)
24. Military aircraft detection dataset.: https://w ww.k aggle.c om/d atas Dang‑Tuan Nguyen received the
ets/a2015 003713/militaryaircraftdetectiondataset engineer degree in Electrical
25. Wang, Y., Wang, T. Zhou, X., Cai, W., Liu, R., Huang, M., Jing, Engineering from Hanoi Univer-
T., Lin, M., He, H., Wang, W., et al.: Transeffidet: aircraft detec- sity of Science and Technology
tion and classification in aerial images based on efficient det and (HUST), Vietnam on Mars 2023.
transformer. Comput. Intell. Neurosci. 2022 (2022) Currently, he is a researcher at
26. Flying-object dataset.: (2022). https://u niver se.r obofl ow.c om/n ew- the Control, Automation in Pro-
workspace-0k81p/flying_object_dataset duction and Improvement of
27. LabelImg Tzutalin.: Git code (2015). https://github.com/tzutalin/ Technology Institute (CAPITI),
labelImg. Accessed 2020 Apr Hanoi, Vietnam. His research
28. Netron.: https://github.com/lutzroeder/netron includes Computer Vision and
29. Xilinx Inc.: DPUCZDX8G for Zynq Ultrascale+ MPSoCs. Ver- real-time applications on embed-
sion PG338 (v3.4) (2022) ded systems.
30. Diganta, M.: Mish: a self regularized non-monotonic activation
function (2019). arXiv:1908.08681

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Journal of Real-Time Image Processing (2024) 21:63 Page 19 of 19 63

Minh Thuy Le received her engi- Quoc‑Cuong Nguyen received the

neer (2006), M.S (2008) degree engineer (1996) and MS (1998)
in Electrical Engineering from degrees in Electrical Engineer-
Hanoi University of Science and ing from Hanoi University of
Technology and PhD. (2013) S c i e n c e a n d Te ch n o l o g y
degree in Optics and Radio Fre- (HUST), Vietnam, and a Ph.D.
quency from Grenoble Institute degree in Signal-Image-Speech-
of Technology, France. She is an Telecoms from INP Grenoble,
Associate Professor and also a France, in 2002. He is an associ-
Group leader of the Radio Fre- ate professor at the at School of
quency (RF) group at the School Electrical and Electronic Engi-
of Electrical and Electronic neering (SEEE), Hanoi Univer-
Engineering (SEEE), Hanoi Uni- sity of Science and Technology
versity of Science and Technol- (HUST). His research interests
ogy (HUST). Her current inter- include Signal Processing,
ests include antenna, Speech Recognition, Beamform-
beamforming, metamaterials, localization-based RF, RF energy har- ing, Smart sensor and RF communication.
vesting, wireless power transfer, and smart sensor.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.

Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Cristinel.ababei.phd
No ratings yet
Cristinel.ababei.phd
143 pages
Power Optimization (Part 2) : Xuan Silvia' Zhang
No ratings yet
Power Optimization (Part 2) : Xuan Silvia' Zhang
26 pages
Palladium Clocking in ICE/STB Flow
No ratings yet
Palladium Clocking in ICE/STB Flow
20 pages
A Brief Overview of The Graphics Pipeline: Cedric Lee
No ratings yet
A Brief Overview of The Graphics Pipeline: Cedric Lee
33 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
EE292A Lecture 1.intro
No ratings yet
EE292A Lecture 1.intro
61 pages
Rvfpga-Soc: Getting Started Guide
No ratings yet
Rvfpga-Soc: Getting Started Guide
5 pages
Hardware Implementation of ECG System On FPGA
No ratings yet
Hardware Implementation of ECG System On FPGA
13 pages
Opencores Coding Guidelines
No ratings yet
Opencores Coding Guidelines
28 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
Design & Verification of AMBA APB Protocol
No ratings yet
Design & Verification of AMBA APB Protocol
4 pages
Network On Chip
No ratings yet
Network On Chip
43 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
7 Series Memory Controllers
100% (1)
7 Series Memory Controllers
36 pages
50 Days of RTL
No ratings yet
50 Days of RTL
169 pages
VERILOG
No ratings yet
VERILOG
111 pages
Jetson Xavier NX Data Sheet v1.3
No ratings yet
Jetson Xavier NX Data Sheet v1.3
40 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Riscv Iommu PDF
No ratings yet
Riscv Iommu PDF
103 pages
Embedded Systems Design - 2: Dr. N. Mathivanan
No ratings yet
Embedded Systems Design - 2: Dr. N. Mathivanan
10 pages
Write Levelling On DDR3
No ratings yet
Write Levelling On DDR3
3 pages
Tegra K1 DataSheet DS06742001v02
No ratings yet
Tegra K1 DataSheet DS06742001v02
83 pages
Phy Ip For Pcie 3.0
No ratings yet
Phy Ip For Pcie 3.0
2 pages
RISC-V - Control Unit
100% (1)
RISC-V - Control Unit
25 pages
Cortex Mo+ Technical Reference Manual
No ratings yet
Cortex Mo+ Technical Reference Manual
51 pages
The Veloce Emulator: Laurent VUILLEMIN Platform Compile Software Manager Emulation Division
No ratings yet
The Veloce Emulator: Laurent VUILLEMIN Platform Compile Software Manager Emulation Division
36 pages
VC LP Ug
No ratings yet
VC LP Ug
592 pages
Video Display Unit (VDU) : Analogue TV & Monitors Computer Monitors
No ratings yet
Video Display Unit (VDU) : Analogue TV & Monitors Computer Monitors
7 pages
VHDL Coding Tips and Tricks
No ratings yet
VHDL Coding Tips and Tricks
209 pages
Memory Controller
No ratings yet
Memory Controller
26 pages
SCT 4 Platform Design Spec Rev1 4 PDF
No ratings yet
SCT 4 Platform Design Spec Rev1 4 PDF
62 pages
Low Power Design of Digital Systems
No ratings yet
Low Power Design of Digital Systems
28 pages
Designing With The Nios II Processor and SOPC Builder Exercise Manual
No ratings yet
Designing With The Nios II Processor and SOPC Builder Exercise Manual
55 pages
Sva CDC Paper Dvcon2006
100% (1)
Sva CDC Paper Dvcon2006
6 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
Palestra 4 Abram Belk
No ratings yet
Palestra 4 Abram Belk
143 pages
Notes - Unit 5
No ratings yet
Notes - Unit 5
12 pages
Extraction and Simulation of Complex Silicon Interposer Structures GIT2012
No ratings yet
Extraction and Simulation of Complex Silicon Interposer Structures GIT2012
20 pages
Mvsim Pag
No ratings yet
Mvsim Pag
16 pages
Verilog 2001 Ref Guide
No ratings yet
Verilog 2001 Ref Guide
56 pages
VIS 2005 LessonLearnt
No ratings yet
VIS 2005 LessonLearnt
3 pages
Slides CW Benini
No ratings yet
Slides CW Benini
23 pages
Clock Domain Crossing (CDC) Design Techniques
No ratings yet
Clock Domain Crossing (CDC) Design Techniques
20 pages
Amba
No ratings yet
Amba
7 pages
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
100% (1)
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
27 pages
CPF User
No ratings yet
CPF User
207 pages
Wed 1142 RISCV Tim Edwards
No ratings yet
Wed 1142 RISCV Tim Edwards
14 pages
Mediotek Health Systems PVT Ltd. Chennai
No ratings yet
Mediotek Health Systems PVT Ltd. Chennai
2 pages
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet
Understanding Checksums and Cyclic Redundancy Checks
From Everand
Understanding Checksums and Cyclic Redundancy Checks
Philip Koopman
No ratings yet
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet
Pic® Micro Principles on Your Mobile
From Everand
Pic® Micro Principles on Your Mobile
Clive W. Humphris
No ratings yet
Applsci 13 04144 v2
No ratings yet
Applsci 13 04144 v2
26 pages
aifpga5
No ratings yet
aifpga5
25 pages
Echoformer Transformer Architecture Based On Radar Echo Characteristics For UAV Detection
No ratings yet
Echoformer Transformer Architecture Based On Radar Echo Characteristics For UAV Detection
15 pages
Drones 07 00682
No ratings yet
Drones 07 00682
18 pages
RF-Enabled Deep-Learning-Assisted Drone Detection and Identification - An End-To-End Approach
No ratings yet
RF-Enabled Deep-Learning-Assisted Drone Detection and Identification - An End-To-End Approach
18 pages
Drone Detection Using Visual Analysis
No ratings yet
Drone Detection Using Visual Analysis
2 pages
Deep Learning-Based Real-Time Multiple-Object Detection and Tracking From Aerial Imagery Via A Flying Robot With GPU-Based Embedded Devices
No ratings yet
Deep Learning-Based Real-Time Multiple-Object Detection and Tracking From Aerial Imagery Via A Flying Robot With GPU-Based Embedded Devices
25 pages
Air TO Air UAV DETECTION
No ratings yet
Air TO Air UAV DETECTION
3 pages
Performance Evaluation of The FastIC Readout ASIC With Emphasis On Cherenkov Emission in TOF-PET
No ratings yet
Performance Evaluation of The FastIC Readout ASIC With Emphasis On Cherenkov Emission in TOF-PET
13 pages
Efficient ASIC Architecture For Low Latency Classic McEliece Decoding
No ratings yet
Efficient ASIC Architecture For Low Latency Classic McEliece Decoding
23 pages
A Reconfigurable Model-Based Design For Rapid Prototyping On FPGA
No ratings yet
A Reconfigurable Model-Based Design For Rapid Prototyping On FPGA
6 pages
FPGA Based Voting Machine
No ratings yet
FPGA Based Voting Machine
7 pages
Wave Generation in FPGA
No ratings yet
Wave Generation in FPGA
3 pages
Engineering Journal Implementation of AES Algorithm
No ratings yet
Engineering Journal Implementation of AES Algorithm
5 pages
Theory and Design of CNC Systems
No ratings yet
Theory and Design of CNC Systems
129 pages
Beckhoff FB1122 0 V11
No ratings yet
Beckhoff FB1122 0 V11
20 pages
EC6014 CR-Unit 1 To 5-1
No ratings yet
EC6014 CR-Unit 1 To 5-1
315 pages
DC-40 Training Material (Advanced) V3.0 en
No ratings yet
DC-40 Training Material (Advanced) V3.0 en
174 pages
Why Take 6.375: 6.375: Complex Digital Systems
No ratings yet
Why Take 6.375: 6.375: Complex Digital Systems
14 pages
K.Ramakrishnan College of Engineering: Samayapurm, Trichy-621 112
No ratings yet
K.Ramakrishnan College of Engineering: Samayapurm, Trichy-621 112
2 pages
Fpga - Altera DE1 and DE2 - Same UART - Electrical Engineering Stack Exchange
No ratings yet
Fpga - Altera DE1 and DE2 - Same UART - Electrical Engineering Stack Exchange
2 pages
USRP Documentation
No ratings yet
USRP Documentation
90 pages
Wallace Tree Using Kogge Stone Adder
No ratings yet
Wallace Tree Using Kogge Stone Adder
5 pages
Hardware Hacking Security Education Platform (Haha Sep V2) : Enabling Hands-On Applied Research of Hardware Security Theory & Principles
No ratings yet
Hardware Hacking Security Education Platform (Haha Sep V2) : Enabling Hands-On Applied Research of Hardware Security Theory & Principles
2 pages
Built in Self-Test: A Review On BIST Insertion Methods and Fault Tracing
No ratings yet
Built in Self-Test: A Review On BIST Insertion Methods and Fault Tracing
10 pages
Vedic Multiplier Using Karatsuba and Urdhva-Tiryagbhyam Sutra
No ratings yet
Vedic Multiplier Using Karatsuba and Urdhva-Tiryagbhyam Sutra
18 pages
Digital Systems Design Using VHDL 3rd Edition Charles H. Roth - eBook PDF download
100% (6)
Digital Systems Design Using VHDL 3rd Edition Charles H. Roth - eBook PDF download
64 pages
RaPiD - Recon Figurable Pipelined Datapath
No ratings yet
RaPiD - Recon Figurable Pipelined Datapath
10 pages
DNN Accelerators For Heterogeneous HPC
No ratings yet
DNN Accelerators For Heterogeneous HPC
53 pages
Introduction To Field Programmable Gate Arrays
No ratings yet
Introduction To Field Programmable Gate Arrays
44 pages
Deep Learning Xilinx
No ratings yet
Deep Learning Xilinx
11 pages
Conway's Game of Life On FPGA: Mojo V2 FPGA Development Board
No ratings yet
Conway's Game of Life On FPGA: Mojo V2 FPGA Development Board
9 pages
Design and Implementation of FPGA Based High Speed Data Acquisition Systems For Embedded Applications
No ratings yet
Design and Implementation of FPGA Based High Speed Data Acquisition Systems For Embedded Applications
52 pages
VHDL 6 Synthesis With VHDL Leonardo
No ratings yet
VHDL 6 Synthesis With VHDL Leonardo
42 pages
USB-Blaster Download Cable User Guide: Subscribe Send Feedback
100% (1)
USB-Blaster Download Cable User Guide: Subscribe Send Feedback
16 pages
Unit I: Introduction To VLSI Circuits 7L
No ratings yet
Unit I: Introduction To VLSI Circuits 7L
2 pages
FPGA Design Tutorial - Advanced HDL Synthesis
No ratings yet
FPGA Design Tutorial - Advanced HDL Synthesis
4 pages
Top 8 FPGA Manufacturers in The World
No ratings yet
Top 8 FPGA Manufacturers in The World
13 pages
FPGA Architecture Principles and Progression
No ratings yet
FPGA Architecture Principles and Progression
26 pages
SNUG Design Tips Paper
No ratings yet
SNUG Design Tips Paper
46 pages
November-December 2011 R07 ALL in ONE
No ratings yet
November-December 2011 R07 ALL in ONE
25 pages