A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations

Ntousis, Odysseas; Makris, Evangelos; Tsanakas, Panayiotis; Pavlatos, Christos

doi:10.3390/technologies13010035

Open AccessArticle

A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations

¹

School of Electrical and Computer Engineering, National Technical University of Athens, 9 Iroon Polytechniou St., 15780 Athens, Greece

²

Digital Development Technologies (DDTech) P.C., 59c Evdomi St., P. Fokaia, 19013 Athens, Greece

³

Hellenic Air Force Academy, Dekelia Air Base, Acharnes, 13671 Athens, Greece

^*

Author to whom correspondence should be addressed.

Technologies 2025, 13(1), 35; https://doi.org/10.3390/technologies13010035

Submission received: 21 November 2024 / Revised: 26 December 2024 / Accepted: 30 December 2024 / Published: 16 January 2025

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

UAVs are widely used for multiple tasks, which in many cases require autonomous processing and decision making. This autonomous function often requires significant computational capabilities that cannot be integrated into the UAV due to weight or cost limitations, making the distribution of the workload and the combination of the results produced necessary. In this paper, a dual-stage processing architecture for object detection and tracking in Unmanned Aerial Vehicles (UAVs) is presented, focusing on efficient resource utilization and real-time performance. The proposed system delegates lightweight detection tasks to onboard hardware while offloading computationally intensive processes to a ground server. The UAV is equipped with a Raspberry Pi for onboard data processing, utilizing an Intel Neural Compute Stick 2 (NCS2) for accelerated object detection. Specifically, YOLOv5n is selected as the onboard model. The UAV transmits selected frames to the ground server, which handles advanced tracking, trajectory prediction, and target repositioning using state-of-the-art deep learning models. Communication between the UAV and the server is maintained through a high-speed Wi-Fi link, with a fallback to a 4G connection when needed. The ground server, equipped with an NVIDIA A40 GPU, employs YOLOv8x for object detection and DeepSORT for multi-object tracking. The proposed architecture ensures real-time tracking with minimal latency, making it suitable for mission-critical UAV applications such as surveillance and search and rescue. The results demonstrate the system’s robustness in various environments, highlighting its potential for effective object tracking under limited onboard computational resources. The system achieves recall and accuracy scores as high as 0.53 and 0.74, respectively, using the remote server, and is capable of re-identifying a significant portion of objects of interest lost by the onboard system, measured at approximately 70%.

Keywords:

distributed system; UAV guidance; object tracking; object detection; deep neural networks; autonomous navigation; onboard computing

1. Introduction

Autonomous UAV navigation has been studied as a solution to a wide range of problems and has proven useful in multiple applications, from search-and-rescue missions to facility surveillance and other critical tasks. While remarkable progress has been achieved in this field, significant weaknesses remain, mostly stemming from the lack of computational resources onboard autonomous vehicles. This paper attempts to present an architecture that allows high-quality visual processing of video captured from a UAV by combining the relatively inaccurate results obtained by the limited hardware on the vehicle with those produced by a high-performance remote ground server, addressing the challenge of complex computations, namely detailed object detection, tracking, and approximate trajectory prediction, as well as navigation command formulation, in real time. The system is specifically designed for the task of detecting and tracking objects (vehicles) and accordingly navigating UAVs in the right direction.

Along with the limited computational power of the UAV, this distributed architecture introduces or intensifies some challenges that must also be taken into consideration. First, reliable communication between the UAV and the server, capable of transferring the necessary information (frames), needs to be established. In addition, the processing on the UAV and the remote system, the communication between them, and decision making for the navigation commands all need to be performed in real time. The combination of the different results is also critical in order to achieve the desired performance.

The proposed system is discussed in the subsequent sections, which are structured as follows. In Section 2, a general theoretical background is provided regarding the various models used, along with some examples of related published works. Section 3 describes the proposed architecture and methods used in the tested system, while Section 4 discusses the results obtained from the various components of the system. Finally, Section 5 presents the conclusions and potential future work.

2. Theoretical Background

2.1. Object Detection

The models examined for the task of object detection in this paper are convolution-based or Transformer-based. More specifically, the best results were observed for the YOLO models (YOLOv5n and YOLOv8x were used) and RT-DETR for the first and second categories, respectively.

2.1.1. YOLO Models

The first “You Only Look Once” model that was introduced [1] is a convolution-based network that attempts to predict bounding-box coordinates and class probabilities at the same time. It consists of 24 convolutional layers and 2 fully connected ones that output the final results.

The input image is initially divided into a grid, with B boxes predicted for each of its cells (B = 2 in the original implementation) and the corresponding confidence scores. Confidence is defined as the product of the predicted object probability and the Intersection over Union between the predicted and the ground-truth boxes. For each cell, C class probabilities are also predicted. The model is trained with a loss function that only takes the classification error for a cell into account if that cell corresponds to an object, and the bounding-box error is considered only for the predictor of the cell that best fits the ground-truth box (has the highest IoU).

Due to the previously mentioned simple architecture and problem approach, YOLO has a significant speed advantage in comparison to other detectors, achieving real-time operation capability—even though the first version achieved lower Average Precision than detectors like Faster R-CNN [2]—and it takes context information from the entire image into account for each prediction. It also learns object features that are less instance-specific, allowing better performance in new types of input.

This first YOLO detector formed the basis for significant improvements in subsequent versions (currently, YOLOv10 is the latest), leading to this method achieving state-of-the-art results in terms of speed and Average Precision. Some of the most important improvements of each version can be seen in Table 1, and the details can be found in the corresponding references.

2.1.2. RT-DETR

The Transformer [17] is the first sequence-to-sequence model architecture that is solely based on attention to understand dependencies and produce results, without the support of auxiliary systems. The model has an encoder–decoder structure that can be seen in Figure 1.

The attention modules accept a query vector and a set of key-value vector pairs (derived from embeddings that encode the “meaning” of the segments of the input) and produce an output, which is a weighted sum of the values, with the weights being defined according to some compatibility function between the query and the key corresponding to each value. In order to include information about the position of each segment in the input, unique positional encoding vectors are added to the embeddings.

If Q is a matrix of queries and K and V are the corresponding key and value matrices, the attention results are calculated as

A (Q, K, V) = s o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \cdot V

(1)

where

d_{k}

is the query/key vector dimension, and division with

\sqrt{d_{k}}

is used to avoid extremely large values of the above dot product.

For better performance, the above operations are not only performed once, but multiple times (multi-head attention), with different query, key, and value weights. The output is calculated as a projection of the concatenation of the individual results of the attention heads.

With the necessary modifications, Transformers have also been used for image processing. One such model designed for object detection is DETR [18], which consists of a CNN backbone network that extracts features from the input image, a Transformer encoder, a Transformer decoder, and a feed-forward network that outputs the final predictions. In order to be processed, the feature map is flattened and passed to the encoder as a sequence.

Based on the architecture of DETR, RT-DETR was proposed in [19] as a real-time model for object detection, achieving state-of-the-art results. An overview of its structure can be seen in Figure 2. Its improved performance, in comparison to the previous Transformer-based detectors (like DETR or Deformable DETR [20]), is the result of redesigning the encoder module, which was the computational bottleneck of the older architectures.

The redesigned encoder (efficient hybrid encoder) consists of two new modules, AIFI (Attention-based Intra-scale Feature Interaction) and CCFF ( CNN-based Cross-scale Feature Fusion). The encoder accepts three scale feature maps, and the AIFI module applies self-attention only on the last (lowest-resolution) layer, since the interaction between lower-level features does not significantly contribute to understanding the semantic relations between parts of the image. The function of CCFF is based on the fusion block, which combines the features of two scales into one new feature. From the output of the encoder, the model selects a fixed number of appropriate features as object queries for the decoder in order to avoid the query optimization difficulty present in older models.

2.2. Object Tracking

2.2.1. KCF Tracker

The Kernelized Correlation Filter (KCF) [21] tracker attempts to reduce the time and memory requirements of the tracking process while avoiding compromising the quality of the results. This is achieved using the properties of circulant matrices. Positive and negative image patch samples are modeled as such matrices by consecutive shifts of the positive sample. The resulting (circulant) matrix for a simple 2D vector patch has the following format:

X = [\begin{matrix} x_{1} & x_{2} & \dots & x_{n} \\ x_{n} & x_{1} & \dots & x_{n - 1} \\ x_{n - 1} & x_{n} & \dots & x_{n - 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{2} & x_{3} & \dots & x_{1} \end{matrix}]

(2)

Using the above data, better performance can be achieved by mapping the inputs to a higher-dimensional feature space and modeling their similarity as the dot product of their projections using the kernel trick [22]. These dot products are stored in an

n \times n

matrix called the kernel matrix. The data format leads to kernel matrices that are also circulant for many useful kernel functions. This makes it possible to benefit from the property of these matrices to become diagonal when the discrete Fourier transform is applied, leading to faster, element-wise computations in the Fourier domain, in which the authors worked to obtain their final solution for the parameters of the tracker, as described in the aforementioned paper.

2.2.2. DeepSort Tracker

DeepSORT [23] is a multi-object tracking-by-detection algorithm based on SORT [24], using deep image features to improve its performance. The tracks’ states can be represented using eight variables: u, v,

γ

, h,

\dot{u}

,

\dot{v}

,

\dot{γ}

, and

\dot{h}

, where (u,v) is the center of the bounding box in x,y coordinates,

γ

is the aspect ratio of the box, and h is its height, with the other four variables being their rates of change.

With an observation of u,v,

γ

, and h, an initial estimation of the new object states is made using a Kalman filter. New detections are then matched to track estimations using the Hungarian algorithm [25], with two distance metrics to be minimized: the squared Mahalanobis distance between the estimations of the Kalman filter and the new detections, and a new appearance distance. This appearance distance is equal to 1—the cosine similarity between the feature vectors of the detections and the Kalman predictions. The feature vectors are extracted from the image paths using a pretrained convolutional network.

2.3. Time-Series Prediction

LSTM Networks

LSTM (Long Short-Term Memory) networks [26] are Recurrent Neural Networks (RNNs) that have been designed to consider both long- and short-term dependencies between the elements of their input sequences. RNNs are neural networks that process the elements of a sequence using their output for element

e_{t - 1}

as an input to calculate the output for the element

e_{t}

, thereby “remembering” information they encountered earlier during the processing of that sequence.

In the case of LSTMs, the structure of the units can be considered as composed of three distinct parts (gates): the forget gate, the input gate, and the output gate. An overview of the internal structure of an LSTM cell can be seen in Figure 3.

The part of this architecture that allows the network to have access to older information is the cell state (

C_{t}

in the shape above), which accepts small-scale changes for every new input. The first update of the cell state is performed by the forget gate, which regulates the percentage of each of the elements in the last cell state that will be preserved. This is achieved by an element-wise product of the old cell state with a forget matrix. The input gate then updates the cell state with new information by selectively adding part of the input and the last hidden state. Finally, the output produced by the output gate, which is also the next hidden state, combines the new cell state with the last hidden state.

2.4. Related Works

Various methods have been proposed for the autonomous navigation of vehicles—in this case, UAVs—according to the results of some processing of information from their environment. A common strategy for enabling such systems to perform adequately is treating the navigation decision-making task as a Markov decision process and using reinforcement learning to train the system on the effects of its interactions with the environment, as discussed in [27,28,29,30]. In other cases, e.g., [31,32,33] (an application-specific case for powerline inspection), “traditional” models were proposed as a solution to the problem. More complex approaches, based mostly on the processing of visual information with neural networks (mainly CNNs), have also been discussed [34,35,36].

For the task of real-time processing of visual information from UAVs for the detection and tracking of objects, multiple solutions have been proposed, with the most recent ones focusing on the use of deep convolution-based networks like YOLO, running on onboard hardware. For example, an evaluation of various fully convolutional networks for agriculture applications (detection of specific plants) run on an NVIDIA Jetson Nano was discussed in [37], along with appropriate data collection and model training. In [38], a software architecture that performs semantic segmentation and object detection with deep learning models, using RGB and thermal images implemented on an NVIDIA AGX Xavier platform, was presented. Ref. [39] proposed a system for powerline fault detections using a custom-trained YOLOv4 model, tested on a Raspberry Pi and three different NVIDIA Jetson versions, while [40] focused on the task of counting vehicles using a fine-tuned CNN-based model running on an NVIDIA Jetson TX2 board.

Some onboard architectures have also been proposed, specifically for the task of vehicle detection and tracking in real time. An example can be found in [41], where a simple and fast system was proposed, including moving vehicle detection using frame differencing and thresholding, and tracking was performed using a Kalman filter. In [42], the object detection task was performed using a saliency map computed from the captured frames. During the tracking process, a Kalman filter is utilized to estimate the state of the object, and the final result is calculated by a local detector that combines the saliency map information and temporal difference between frames. In [43], the authors presented a multi-task neural network for vehicle detection, a fast object tracker (MOSSE [44]), and a speed estimation algorithm, running on an Nvidia Xavier NX board.

Attempts have also been made to assign parts of the computational workload to remote devices in the cloud. In [45], the proposed system performs lightweight onboard objectness estimation using the Binarized Normed Gradients (BING) algorithm [46], and in the case of high detected objectness, a high-resolution image of the area of interest is sent to a cloud server, which further processes the image using an R-CNN-based model. In [47], the proposed system performs lightweight detection and counting of people using a small YOLO model on a Raspberry Pi, with an Intel Neural Compute Stick Movidius VPU and, in case some abnormality is detected, a short video is sent to a cloud server for further, more demanding processing.

There are also examples of system architectures more closely related to the one discussed in the present paper. In [48], the proposed system consists of a UAV equipped with an aerial computing platform using an Intel i5 CPU and an Nvidia GTX1070 GPU that detects and tracks an object with specific characteristics that can be determined using image captioning and keyword matching, and a ground GIS (geographic information system) server that allows a more precise estimation of the position of the target object in global coordinates and information about its environment, which can be useful in predicting its trajectory. Another two-part system was presented in [49], with two versions being examined. In the first one, the necessary computations (detection and tracking) are performed on the UAV using an NVIDIA Jetson board. In the second version, this process is implemented on less powerful boards (like the Raspberry Pi) using an Intel Neural Compute Stick. In both cases, the images are transmitted to a ground station that independently performs the procedure on a GPU.

3. Materials and Methods

The proposed solution for the task at hand relies on a 2-part processing structure. The first, lighter part of the necessary computation is performed directly on the UAV (which can be seen in Figure 4), and the second, heaviest, and most precise part is assigned to a powerful remote server, which constantly communicates with the UAV and provides assisting directions.

3.1. UAV

3.1.1. Hardware

For the navigation of the UAV, the Cube Orange autopilot, manufactured by CubePilot (Breakwater, Australia) was used. It is equipped with a 32-bit ARM STM32H753 Cortex-M7 processor, manufactured by STMicroelectronics (Plan-les-Ouates, Geneva, Switzerland), controlling the flight of the UAV (quadcopter) based on the inputs of various sensors connected to it and navigation commands sent by the user via a remote control system. The communication with the RC system is performed via a serial port using the Mavlink 2 protocol. The autopilot can also accept Mavlink commands via a secondary port, which are useful for automatically directing the UAV when an object of interest is located. The pilot’s software is configured using a large set of parameters that can be set appropriately and function in various modes, which determine its behavior in different situations and its response to commands.

Figure 4. A picture of the UAV on which the system was tested.

The video input of the detection/tracking system comes from an SIYI A8 Mini camera, manufactured by SIYI Technology (Shenzhen, China). It provides a high-resolution video stream (up to 4K) and has high light sensitivity, allowing it to capture detailed frames, even in very low lighting conditions. The camera is positioned on a gimbal capable of rotating around 3 axes, operating in 3 modes (follow, lock, and fpv), allowing it to slowly follow the UAV’s rotation, stay in its original direction, or move simultaneously with it. It stabilizes the camera’s view, resisting the tilts of the UAV’s body, which is extremely important in the present case since sudden changes in the angle of the video input would make the tracking process impossible, especially when the UAV adjusts its speed. The video output can be provided in 3 ways: Ethernet, HDMI, or CVBS. By default, the output is provided via Ethernet as an RTSP stream.

The Herelink system, manufactured by Cube Pilot, was used for the manual remote control of the UAV. The ground station of the system exchanges Mavlink commands with the air unit on the UAV, with the transmission/reception of video data also being supported. In addition to navigation commands, Herelink allows the transmission of commands for changing the flight mode of the autopilot.

The processing of the video data on the UAV is performed on a Raspberry Pi 4b (8GB of RAM), manufactured by Raspberry Pi Ltd. (Cambridge, England, UK). A key factor for the successful implementation of the proposed system is the sensitive and time-efficient object detection process. This cannot be performed directly on the Raspberry Pi due to an obvious lack of computational resources. Therefore, the Neural Compute Stick 2, manufactured by Intel (Santa Clara, CA, USA), is selected for the specific task.

The function of the NCS 2 relies on a hardware accelerator that can perform inference with deep neural networks, overcoming the limitations of the CPU of the board it is connected to. This inference is performed using the Intel OpenVINO inference engine API, which accepts the model in Openvino IR format (an XML and a bin file), which encodes the structure and weights of the model. Since the NCS 2 has been discontinued, and the latest OpenVINO versions do not support it, an older version of the toolkit (2021.4.2) is used.

The frames from the HDMI output of the camera are received from the Raspberry Pi via USB using the appropriate adaptor (HDMI-to-USB video card). Receiving the video stream through the default RTSP stream via Ethernet is not possible, since proper decoding requires the reception of all key frames, which causes significant buffering delay due to the inability of the Raspberry Pi to decode the stream with adequate speed, even with a dedicated reading thread.

A portion of the captured frames, depending on the connection quality, need to be sent to the ground server. In proper operating conditions, this is done using the WiFi module of the Raspberry Pi, assuming that the UAV is flying in a confined space with high-speed wireless network coverage. However, if this is not the case, the Raspberry Pi is equipped with a 4G HAT, manufactured by Waveshare (Shenzhen, China), with which it communicates via UART. Of course, this will probably not result in the same transfer speed, leading to lower frames per second for the server. The baud rate for the communication between the Raspberry Pi and the 4G HAT is set to the maximum allowed by the board (3 Mbps), which is also close to the maximum supported by the HAT (around 3.5 Mbps).

After a cycle of data processing (as described below), the Raspberry Pi directs the UAV toward the specified target. This is achieved by setting a connection between a USB port on the computer and the secondary telemetry port of the autopilot using a USB-to-serial converter, through which the Raspberry sends the necessary Mavlink commands. The hardware components and the connections between them can be seen in Figure 5.

3.1.2. Software

The purpose of the software running on the UAV is to detect vehicles in its field of view, select one as the target to be followed, and track it while minimizing delays. In addition, it needs to be able to update its target’s location according to the guidance of the server, assuming a reliable connection to it can be established. The software is also responsible for directing the UAV with Mavlink commands using the previously mentioned serial interface, attempting to maintain the target in the center of the frames.

When the appropriate direction is given to it from the server, or directly from the RC system if it is operating without a server connection, it detects objects in the most recent frame read from the camera and selects the object located closest to the indicated position as a target. It then proceeds to track the object in the following frames, and at regular time intervals (e.g., 5 s), it performs the detection process again. If an object is found close to the target track, the target is updated. This is important mainly because the object can slowly shift out of the tracked bounding box. The KCF tracker is used for this purpose, as it is implemented in OpenCV due to the balance it achieves between speed and accuracy.

As previously mentioned, the local detection model on the Raspberry Pi needs to be able to run on the Intel NCS2. Although it can enable the processing of relatively complex neural networks, its capabilities are limited. Therefore, a light and effective detection model is necessary for this task. The first option for this purpose is a YOLO model. Since the NCS2 has been discontinued, the latest versions are not supported, and the most recent model that could be successfully used is YOLOv5. Due to the existing computational limitations, the smallest version (YOLOv5n) was used. The model was trained on the VisDrone2018-DET dataset [50]. As discussed below, this training resulted in a significantly enhanced performance for the task at hand.

The navigation commands are formed based on the estimated speed of the selected target. The coordinate system of the camera is presumed to differ from that of the vehicle. Therefore, the estimation is calculated as follows:

Δ V x_{e s t} = Δ V x_{s e e n} \cdot c o s (θ) + Δ V y_{s e e n} \cdot s i n (θ)

(3)

Δ V y_{e s t} = Δ V y_{s e e n} \cdot c o s (θ) - Δ V x_{s e e n} \cdot s i n (θ)

(4)

where

Δ V_{s e e n} = m p p \cdot Δ V_{s e e n_p i x e l s_p e r_s e c o n d}

,

θ

is the current estimation of the camera’s rotation, and mpp is the estimated meters-per-pixel ratio. These parameters were initially set to

θ = 0

and

m p p = \frac{a v e r a g e e x p e c t e d l e n g t h (m)}{a v e r a g e s i d e l e n g t h (p i x e l s)}

. They were then improved by observing the effects of speed changes of the UAV and adjusting them accordingly.

3.2. Remote Server

The remote server is equipped with an NVIDIA (Santa Clara, CA, USA) A40 GPU. Its goal is to detect vehicles using accurate models that cannot be run directly on the UAV, track indicated targets, and re-estimate their locations when necessary. The processing of the frames results in a sequence of future position predictions that are sent to the Raspberry Pi on the UAV, which uses them to locate the target if it is lost or incorrect.

For the detection process, both convolutional and Transformer-based models were tested. The models were once again trained on the VisDrone2018-DET dataset. The best real-time results were obtained using YOLOv8x and RT-DETR, and YOLOv8x was eventually selected due to its inference time advantage. Tracking was performed using the DeepSORT tracker due to its ability to take deep image features of the target into consideration. To achieve consistently low processing times, it can be applied in a constricted area around the target if there is a significantly high number of detected objects.

The prediction of future positions was performed using an LSTM network with 64 hidden units, followed by a fully connected layer, and has 2 main purposes: first, the communication between the UAV and the server causes some delay between sending a frame and receiving a response; therefore, the server needs to send a sequence of timestamped predictions, from which the Raspberry selects the closest to its current time. The predictions also make a significant contribution to the relocation of a lost target, providing an estimation of its approximate position. The input to this network is a sequence of the previous 10 recorded positions of the target, recorded during the processing of frames where the target was seen. The LSTM was trained on the VisDroneVDT-2018 dataset [51].

The re-estimation of the target’s position is based on 3 metrics that indicate the similarity of a detected object with the lost target: its location (compared to the above-mentioned predictions), the similarity of its features extracted from appropriate layers of the detection models with those of the target recorded during the tracking process, and its dimensions. The “closeness” score is calculated using a Gaussian distribution around the predicted point, the parameters, which are estimated from the predictions made by the model for the test set of VisDroneVDT-2018 using a maximum likelihood estimator. The feature similarity is calculated as a dot product between the obtained feature vectors. The dimension similarity is probably partly encoded in the feature similarity and is most likely not going to provide useful information in a scene with multiple vehicles of the same type; therefore, it is not considered to be as important. The overall software architecture of the discussed system is presented in Figure 6.

3.3. Dataset Distribution

The VisDrone2018-DET dataset consists of a set of 8599 images from the point of view of a UAV with detailed annotations, focusing on cars and pedestrians. It is split into 3 sub-datasets: training (6471 images), validation (548 images), and testing (1580 images). Its importance lies in the difference between the top and sideways (common in most datasets) appearance of objects, which can be a great obstacle to the detection process.

The VisDroneVDT-2018 dataset consists of 79 video clips (33,366 frames), with approximately 1.5 million annotations. It is split into training, validation, and test sets, which include 56, 7, and 16 clips, respectively. The annotations are similar to those in VisDrone2018-DET and also include object IDs that indicate the tracks of the annotated objects.

4. Results and Discussion

4.1. Onboard Model

As mentioned, YOLOv5n was selected as the onboard model. Weights pretrained on the COCO dataset were fine-tuned on the VisDrone2018-DET dataset, achieving a significant decrease in loss and better results, as shown in Figure 7 and Figure 8.

Due to the significant differences between the drone images and the images in the COCO dataset, it might be possible to achieve similar or better results by training the model from scratch. An attempt at this resulted in similar, slightly worse results; therefore, the fine-tuned model was eventually selected. The tested models were trained using the default hyperparameter settings provided by the Ultralytics implementation [10], leading to an increase of about 100% in the precision scores on the validation set of VisDrone2018-DET, as seen in the graphs below.

4.2. Selection of the Server’s Detection Model

The detection results were evaluated based mostly on recall, which is defined as follows:

R e c a l l = \frac{Number of Objects Correctly Detected}{Number of Objects Correctly Detected + Number of Objects Not Detected}

This metric was chosen due to the nature of the problem. False positives were not the main concern, since specific objects needed to be detected and tracked; therefore, being able to “understand” those objects as valid detections was more important. Of course, a very high number of false positive detections might introduce some difficulty in the repositioning of the target and the tracking process. Therefore, the accuracy metric, defined as

A c c u r a c y = \frac{Number of Objects Correctly Detected}{Number of Objects Correctly Detected + Number of False Detections}

was also examined, only being a concern when it was extremely low. As previously discussed, YOLOv8x and RT-DETR were the best-performing models from the convolutional and transformed-based families, respectively. Their performance after training on VisDrone2018-DET can be seen in Table 2 and Table 3. While the results are similar (when the appropriate confidence threshold was chosen), inference with YOLOv8x was performed in around 14 ms, while RT-DETR needed around 34–35 ms. Therefore, YOLOv8x was chosen for the final implementation. Similar to the onboard model, the YOLOv8x and RT-DETR models were trained using the default hyperparameters as provided by their Ultralytics implementations [14,19].

4.3. Tracking Prediction Model

For training the LSTM model, the Adam optimizer included in torch.optim was used, along with a mean squared error loss function. Several experiments were run with the rest of the training hyperparameters and data passed to the model. Specifically, various batch sizes (16, 32, and 64), learning rates (0.001, 0.0005, 0.0002,

2 \times 10^{- 5}

,

10^{- 5}

,

5 \times 10^{- 6}

, and decaying from 0.001 to

10^{- 5}

), scaled and unscaled data, and different neighbor configurations (no neighbors, closest 4 neighbors, closest 4 neighbors with distance limit, and 12 neighbors) were tested. In all cases, the size of the validation set was 20% of the total, and the model was trained for 150 epochs (100 epochs after scaling the data). The following observations were made:

Number of neighbors: The different alternatives described above did not make a significant difference, with the exception of the total lack of neighbor information, which resulted in slightly worse results (RMSE > 8 on the validation set, with a decaying learning rate, batch size = 32, and scaled data, while the other configurations achieved an RMSE of around 6.5–7).
Batch size: The batch sizes tested did not seem to have a significant effect on the final results. For a batch size of 32, it seemed like convergence was achieved a few epochs earlier.
Data Scaling: Scaling the data caused a significant improvement in the results. Min-max scaling was used, which caused the RMSE to drop from around 27 to 13 with learning rate = 0.0001 and batch size = 32.
Learning Rate: The learning rate configuration was also of key importance. With learning rates above 0.0005, the model could not converge to a stable state, and with very low learning rates, it converged to local minima, leading to increased errors. The solution to this was the introduction of a decaying learning rate, decreasing from 0.001 to $10^{- 5}$ , which resulted in a final validation loss of about 6.5–7 with batch size = 32, four neighbors, and scaled data.

Some of the above results can be seen in Figure 9 and Figure 10. The presented error values were calculated between the six-step prediction vectors and the corresponding ground-truth ones.

4.4. An Indication of the Server’s Contribution

A simulation was run on 145 targets from 46 videos from the UAVDT Benchmark dataset [52], on which none of the models had been trained, in order to produce an approximate measurement of the contribution of the remote server. The purpose of this simulation was to estimate the percentage of target losses that can be prevented by using the proposed system. It was run using a reliable connection of sufficient speed and, in order to create realistic conditions, the frames sent to the server were one-third of the total received by the onboard system, with one-quarter of the original image size. A correction of the target’s location was considered successful when it extended the duration of its tracking for at least 30 frames or until it exited the UAV’s field of view. It was found that the onboard system lost 50 of the 145 targets at some point during the process of tracking, and 35 of them (70%) were correctly re-identified by the server.

Using the results of this simulation, estimations could be made for the ML (mostly lost—the percentage of objects whose trajectories were correctly tracked for less than 20% of their actual length) and MT (mostly tracked—the percentage of objects whose trajectories were correctly tracked for at least 80% of their actual length) metrics, as presented in [52]. In order to perform this estimation, only objects that were set as targets from their appearance in the frame were taken into consideration, while others that were targeted randomly after the previous target was lost or exited the frame were discarded. This resulted in 105 valid targets, of which 80 were mostly tracked, and 7 were mostly lost, resulting in MT and ML scores of 76% and 7%. This performance is promising, and although these results might seem extremely positive, a direct comparison to the results presented in works such as [52] cannot be made since they are solely based on tracking, without distinctly considering deep features, trajectory prediction, or performing re-identification of the tracked objects.

The performance of the remote server was, as expected, significantly better than that of the onboard system. However, due to the lower resolution of the received frames and the reduced frequency, it occasionally encountered difficulties in the case of extremely small targets, especially when they resembled their surroundings. It was found that such targets were sometimes detected for longer intervals by the onboard system when its confidence threshold was low enough; however, such cases were rare and probably unrealistic, since for the purposes of the simulation, this threshold was set extremely low (lower than it should be in real conditions), causing multiple false detections. The threshold was set this way in order to avoid frequently losing targets and obtaining a false high percentage of re-identifications, something that could be amplified by the fact that connection issues that will have to be dealt with in real applications were absent in the simulation. Some short videos of the tracking and re-identification process and the testing of the real system can be found in [53]. Examples of a success and a failure of the correction system can be seen in Figure 11 and Figure 12.

4.5. Model Inference Times and Overall Temporal Performance

Regarding the tasks of the Raspberry Pi, the inference of the detection model required approximately 50 ms, while each iteration of the tracking process required less than 10 ms. The entire onboard system operated at approximately 20 frames per second. On the server, detection using YOLOv8x was performed in less than 15 ms, the tracking prediction results were produced in around 20 ms, and the object tracking required less than 10 ms for each frame. However, since communication between the UAV and the server is not ideal, the server is expected to function at about 15 frames per second.

4.6. Overview of Results and Contributions

The proposed system is a result of attempts to address known challenges in real-time edge computation on UAVs, with an application in high-performance detection and tracking tasks. The main contribution of this paper is the introduction of a hardware and software architecture capable of (1) increasing the probability of correct target position estimation by combining the results of two detection models—onboard and remote—with different capabilities, (2) using trajectory prediction to achieve the above combination in the case of communication delays, and (3) performing re-estimation of lost target location using different metrics, namely the distance from the predicted trajectory, deep feature similarity, and dimension similarity.

During the design of the proposed system, several difficulties that were encountered resulted in useful findings worth mentioning. For example, the computational and real-time operation constraints resulted in the model selection presented above. Additionally, the unusual point of view of the processed images made it necessary to use a specialized dataset, leading to the previously discussed improvements. The significant assistance provided by the trajectory prediction model, as well as its ability to consider neighboring vehicles, despite its simplicity, should also be noted. Another interesting result can be seen in the target position re-estimation by the server, along with the guiding instructions it provided to the UAV, which is a key indication of the performance of the system. An approximate example of the effect of these instructions can be observed in the results of the presented simulation. The most significant restriction that might obstruct the functioning of the system is the communication quality requirements. The issue most likely to arise as a result of this restriction is a very low framerate for the server. Due to the higher performance of the server models, it is relatively tolerant of this problem. Of course, in extreme cases, the system will be unable to function. Therefore, for real-world applications with problematic connection capabilities, an autonomous mode is implemented, which allows the UAV to operate based solely on its own detection and tracking algorithm.

5. Conclusions

The proposed architecture can provide significant assistance in the guidance of a UAV by combining low-power local processing and more complex remote analysis of the situation. The simple detection and tracking procedure performed on the UAV enables it to automatically navigate, while a high-performance server provides instructions that can prevent or correct incorrect decisions, especially in more complex situations. The tasks assigned to this system can be adapted for various situations with proper model training, like animals or rescue team personnel, and hardware adaptations, such as using thermal cameras for operation in nighttime or other challenging conditions.

A Raspberry Pi was used for processing on the UAV, along with an NCS2 device, allowing the object detection models to run, given the restricted available computational resources. This local processing involves detecting objects of interest using a YOLOv5n model trained on the VisDrone2018-DET dataset and tracking an indicated target using the OpenCV tracker. The local system is also responsible for directing the UAV appropriately, sending the necessary Mavlink commands to the autopilot. The Raspberry Pi adjusts its estimation of the target based on directions given by the ground server, maintaining a reliable internet connection.

The ground server performs more accurate and reliable object detection using a YOLOv8x model, which is also trained on the aforementioned dataset. It is also responsible for predicting the future trajectory of the target and relocating it when necessary, using a combination of tracking prediction and feature vectors extracted from the detected objects while providing guiding directions about the target’s location to the UAV.

An important issue to be addressed in a potential future extension of the above work is the resolution of practical issues that obstruct the proper functioning of the system. For example, different computation centers and transmission hardware on the UAV could potentially allow for significant performance improvements. Significant progress can also be made regarding the objectives of the system. Instead of tracking and following objects, more complex missions can be assigned to the system, requiring some sort of semantic analysis of the scenes, which could also allow reliable predictions about their future development, leading to better complex decision making. Improvements to the proposed architecture can also be made by integrating targeted methods or tools, such as those presented in [54,55,56], which could enhance the estimations made by the system, as well as the combinations of the results of the processes running on the UAV and the server.

Author Contributions

Conceptualization, O.N. and C.P.; methodology, C.P., O.N. and E.M.; software, O.N.; validation, C.P., O.N. and E.M.; formal analysis, O.N. and E.M.; investigation, O.N.; resources, O.N.; writing—original draft preparation, O.N. and E.M.; writing—review and editing, E.M. and C.P.; visualization, O.N.; supervision, C.P. and P.T.; project administration, C.P. and P.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Christos Pavlatos was employed by the company DDTech. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DETR	DEtection TRansformer
RT-DETR	Real-Time DEtection TRansformer
YOLO	You Only Look Once
NCS	Neural Compute Stick
CNN	Convolutional Neural Network
R-CNN	Region-based Convolutional Neural Network
UAV	Unmanned Aerial Vehicle
LSTM	Long Short-Term Memory
RNN	Recurrent Neural Network
KCF	Kernelized Correlation Filter
SORT	Simple Online and Realtime Tracking
HAT	Hardware Attached on Top
RMSE	Root Mean Square Error
FPN	Feature Pyramid Network
CSPNet	Cross-Stage Partial Network
SPP	Spatial Pyramid Pooling
SPPF	Spatial Pyramid Pooling—Fast
BCE Loss	Binary Cross-Entropy Loss
CIoU Loss	Complete IoU Loss
IoU	Intersection over Union
E-ELAN	Extended Efficient Layer Aggregation Network
NMS	Non-Maximum Suppression
AIFI	Attention-based Intra-scale Feature Interaction
CCFF	CNN-based Cross-scale Feature Fusion
API	Application Programming Interface
MOSSE	Minimum Output Sum of Squared Error

References

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2019; pp. 1571–1580. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 12 June 2024). [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 June 2024).
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Newry, UK, 2017; Volume 30. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2023; pp. 16965–16974. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef]
Scholkopf, B.; Smola, A. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; Adaptive Computation and Machine Learning series; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.T.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. 1955, 52, 83–97. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Zhang, X.; Zhang, X. Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, Canada, 14–16 November 2017; pp. 858–862. [Google Scholar] [CrossRef]
Imanberdiyev, N.; Fu, C.; Kayacan, E.; Chen, I.M. Autonomous navigation of UAV by using real-time model-based reinforcement learning. In Proceedings of the 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV), Phuket, Thailand, 13–15 November 2016; pp. 1–6. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Dong, Q. Autonomous navigation of UAV in multi-obstacle environments based on a Deep Reinforcement Learning approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
Cui, J.Q.; Lai, S.; Dong, X.; Liu, P.; Chen, B.M.; Lee, T.H. Autonomous navigation of UAV in forest. In Proceedings of the 2014 International Conference on Unmanned Aircraft Systems (ICUAS), Orlando, FL, USA, 27–30 May 2014; pp. 726–733. [Google Scholar] [CrossRef]
Aguilar, W.G.; Salcedo, V.S.; Sandoval, D.S.; Cobeña, B. Developing of a Video-Based Model for UAV Autonomous Navigation. In Proceedings of the Computational Neuroscience, Porto Alegre, Brazil, 22–24 November 2017; Barone, D.A.C., Teles, E.O., Brackmann, C.P., Eds.; Springer: Cham, Switzerland, 2017; pp. 94–105. [Google Scholar]
Li, Y.; Zhang, W.; Li, P.; Ning, Y.; Suo, C. A Method for Autonomous Navigation and Positioning of UAV Based on Electric Field Array Detection. Sensors 2021, 21, 1146. [Google Scholar] [CrossRef]
Kouris, A.; Bouganis, C.S. Learning to Fly by MySelf: A Self-Supervised CNN-Based Approach for Autonomous Navigation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–9. [Google Scholar] [CrossRef]
Mansouri, S.S.; Kanellakis, C.; Kominiak, D.; Nikolakopoulos, G. Deploying MAVs for autonomous navigation in dark underground mine environments. Robot. Auton. Syst. 2020, 126, 103472. [Google Scholar] [CrossRef]
Pfeiffer, M.; Schaeuble, M.; Nieto, J.; Siegwart, R.; Cadena, C. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1527–1533. [Google Scholar] [CrossRef]
Menshchikov, A.; Shadrin, D.; Prutyanov, V.; Lopatkin, D.; Sosnin, S.; Tsykunov, E.; Iakovlev, E.; Somov, A. Real-Time Detection of Hogweed: UAV Platform Empowered by Deep Learning. IEEE Trans. Comput. 2021, 70, 1175–1188. [Google Scholar] [CrossRef]
Speth, S.; Alves Gonçalves, A.; Rigault, B.; Suzuki, S.; Bouazizi, M.; Matsuo, Y.; Prendinger, H. Deep learning with RGB and thermal images onboard a drone for monitoring operations. J. Field Robot. 2022, 39, 840–868. [Google Scholar] [CrossRef]
Ayoub, N.; Schneider-Kamp, P. Real-Time On-Board Deep Learning Fault Detection for Autonomous UAV Inspections. Electronics 2021, 10, 1091. [Google Scholar] [CrossRef]
Amato, G.; Ciampi, L.; Falchi, F.; Gennaro, C. Counting Vehicles with Deep Learning in Onboard UAV Imagery. In Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC), Barcelona, Spain, 29 June–3 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
Lee, M.H.; Yeom, S. Detection and Tracking of Multiple Moving Vehicles with a UAV. Int. J. Fuzzy Log. Intell. Syst. 2018, 18, 182–189. [Google Scholar] [CrossRef]
Wu, Y.; Sui, Y.; Wang, G. Vision-Based Real-Time Aerial Object Localization and Tracking for UAV Sensing System. IEEE Access 2017, 5, 23969–23978. [Google Scholar] [CrossRef]
Balamuralidhar, N.; Tilon, S.; Nex, F. MultEYE: Monitoring System for Real-Time Vehicle Detection, Tracking and Speed Estimation from UAV Imagery on Edge-Computing Platforms. Remote Sens. 2021, 13, 573. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar] [CrossRef]
Lee, J.; Wang, J.; Crandall, D.; Šabanović, S.; Fox, G. Real-Time, Cloud-Based Object Detection for Unmanned Aerial Vehicles. In Proceedings of the 2017 First IEEE International Conference on Robotic Computing (IRC), Taichung, Taiwan, 10–12 April 2017; pp. 36–43. [Google Scholar] [CrossRef]
Zhang, Z.; Lin, W.Y.; Torr, P. BING: Binarized Normed Gradients for Objectness Estimation at 300fps. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3286–3293. [Google Scholar] [CrossRef]
Alam, M.S.; Natesha, B.V.; Ashwin, T.S.; Guddeti, R.M.R. UAV based cost-effective real-time abnormal event detection using edge computing. Multimed. Tools Appl. 2019, 78, 35119–35134. [Google Scholar] [CrossRef]
Wang, S.; Jiang, F.; Zhang, B.; Ma, R.; Hao, Q. Development of UAV-Based Target Tracking and Recognition Systems. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3409–3422. [Google Scholar] [CrossRef]
Hossain, S.; Lee, D.J. Deep Learning-Based Real-Time Multiple-Object Detection and Tracking from Aerial Imagery via a Flying Robot with GPU-Based Embedded Devices. Sensors 2019, 19, 3371. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Nie, Q.; Cheng, H.; Liu, C.; Liu, X.; et al. VisDrone-DET2018: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2019; pp. 437–468. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Wu, H.; Nie, Q.; Cheng, H.; Liu, C.; et al. VisDrone-VDT2018: The Vision Meets Drone Video Detection and Tracking Challenge Results. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2019; pp. 496–518. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.F.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Available online: https://tinyurl.com/uavguidancesample (accessed on 26 December 2024).
Liu, W.; Wang, G.; Sun, J.; Bullo, F.; Chen, J. Learning Robust Data-Based LQG Controllers From Noisy Data. IEEE Trans. Autom. Control 2024, 69, 8526–8538. [Google Scholar] [CrossRef]
Li, W.; Qin, K.; Li, G.; Shi, M.; Zhang, X. Robust bipartite tracking consensus of multi-agent systems via neural network combined with extended high-gain observer. ISA Trans. 2023, 136, 31–45. [Google Scholar] [CrossRef]
Shi, L.; Ma, Z.; Yan, S.; Zhou, Y. Cucker-Smale Flocking Behavior for Multiagent Networks With Coopetition Interactions and Communication Delays. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 5824–5833. [Google Scholar] [CrossRef]

Figure 1. The Transformer model architecture as presented in [17].

Figure 2. The structure of RT-DETR as presented in [19].

Figure 3. Representation of an LSTM cell structure.

Figure 5. An overview of the hardware of the proposed architecture.

Figure 6. An overview of the software of the proposed architecture.

Figure 7. Losses and precision scores during the fine-tuning of the pretrained weights (validation set). The blue lines indicate the precise results, and the yellow dots represent the corresponding smoothed curves.

Figure 8. Results indicating the effects of custom model training. (a) Detection results before fine-tuning the model. (b) Detection results after fine-tuning the model.

Figure 9. RMSE values for training with unscaled data and batch size = 32. (a) RMSE for training performed with learning rate = 0.001, resulting in the model being unable to converge. (b) RMSE for training with learning rate = 0.0001, resulting in the model converging. However, due to the lack of scaling, the final RMSE is close to 27.

Figure 10. RMSE values for training with scaled data and batch size = 32. (a) RMSE for training without neighbor information. (b). RMSE for training with information on the position of 4 neighbors.

Figure 11. Example of a case where the onboard system failed (the target represented by a red box passed behind the street sign) and the server corrected it. A different run of the same simulation can be found in [53],where the on-board system failure is presented earlier.

Figure 12. Example of a case where the server could not re-identify the lost target (represented by a red box).

Table 1. Significant characteristics of various YOLO models.

Model	Significant Modifications
YOLOv2 [3]	Darknet-19 network; anchor boxes; k-means on training boxes to
	determine initial box coordinates; batch normalization increased
	classifier resolution; added passthrough layer for detection of more
	detailed features; multiscale training; direct box location prediction
YOLOv3 [4]	Darknet-53 network; predictions across 3 scales with an FPN [5]-like
	mechanism; multi-label classification
YOLOv4 [6]	CSPDarknet53 network [7]; SPP block [8];
	PANet path aggregation network [9]
YOLOv5 [10]	SPPF structure; updated box coordinate prediction formula;
	training loss as a weighted sum of classes losses (BCE loss);
	objectness loss (BCE loss) and location loss (CIoU loss)
YOLOX [11]	Decoupled head for separate classification; box localization
	and objectness prediction; anchor-free box detection
YOLOv7 [12]	E-ELAN backbone block; lead + auxiliary head for output
	and deeply supervised training, respectively;
	planned re-parametrization
YOLOv6 [13]	EfficientRep backbone; enhancements in neck structure (Rep-PAN)
	and head (Efficient Decoupled Head)
YOLOv8 [14]	CSPDarknet53 backbone; C2f module instead of FPN (combination
	of features of various levels)
YOLOv9 [15]	Generalized Efficient Layer Aggregation Network (GELAN);
	programmable gradient information
YOLOv10 [16]	Dual-label assignment to avoid NMS post-processing

Table 2. YOLOv8x training results.

YOLOv8x
Image	Confidence	Pretrained	Pretrained	New	New
Size	Threshold	Recall	Accuracy	Recall	Accuracy
1442 × 856	0.3	0.22	0.67	0.53	0.74
481 × 285	0.3	0.15	0.58	0.46	0.73
1442 × 856	0.1	0.31	0.51	0.58	0.55
481 × 285	0.1	0.23	0.47	0.50	0.57
1442 × 856	0.05	0.36	0.41	0.59	0.45
481 × 285	0.05	0.26	0.38	0.52	0.46

Table 3. RT-DETR training results.

RT-DETR
Image	Confidence	Pretrained	Pretrained	New	New
Size	Threshold	Recall	Accuracy	Recall	Accuracy
1442 × 856	0.3	0.34	0.52	0.60	0.57
481 × 285	0.3	0.25	0.50	0.50	0.60
1442 × 856	0.2	0.41	0.34	0.63	0.39
481 × 285	0.2	0.32	0.33	0.53	0.43
1442 × 856	0.1	0.49	0.15	0.66	0.19
481 × 285	0.1	0.40	0.15	0.57	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ntousis, O.; Makris, E.; Tsanakas, P.; Pavlatos, C. A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations. Technologies 2025, 13, 35. https://doi.org/10.3390/technologies13010035

AMA Style

Ntousis O, Makris E, Tsanakas P, Pavlatos C. A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations. Technologies. 2025; 13(1):35. https://doi.org/10.3390/technologies13010035

Chicago/Turabian Style

Ntousis, Odysseas, Evangelos Makris, Panayiotis Tsanakas, and Christos Pavlatos. 2025. "A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations" Technologies 13, no. 1: 35. https://doi.org/10.3390/technologies13010035

APA Style

Ntousis, O., Makris, E., Tsanakas, P., & Pavlatos, C. (2025). A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations. Technologies, 13(1), 35. https://doi.org/10.3390/technologies13010035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dual-Stage Processing Architecture for Unmanned Aerial Vehicle Object Detection and Tracking Using Lightweight Onboard and Ground Server Computations

Abstract

1. Introduction

2. Theoretical Background

2.1. Object Detection

2.1.1. YOLO Models

2.1.2. RT-DETR

2.2. Object Tracking

2.2.1. KCF Tracker

2.2.2. DeepSort Tracker

2.3. Time-Series Prediction

LSTM Networks

2.4. Related Works

3. Materials and Methods

3.1. UAV

3.1.1. Hardware

3.1.2. Software

3.2. Remote Server

3.3. Dataset Distribution

4. Results and Discussion

4.1. Onboard Model

4.2. Selection of the Server’s Detection Model

4.3. Tracking Prediction Model

4.4. An Indication of the Server’s Contribution

4.5. Model Inference Times and Overall Temporal Performance

4.6. Overview of Results and Contributions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI