1. Introduction
Airborne data is essential for accurately evaluating open environments, and unmanned aerial vehicles (UAVs) have become vital tools for providing this information quickly and efficiently. With advancements in UAV technology, particularly in autonomous operations, there is an increasing demand for higher-quality, broader and more detailed sensor information, which can significantly impact critical decision-making in various scenarios. This paper focuses on one key autonomous capability: vision-based mobile target tracking and following using UAVs. Tracking and following moving targets from the air has diverse applications, including search and rescue [
1], police pursuits [
2], and vehicle monitoring [
3]. Additionally, this capability facilitates enhanced collaboration between aerial and ground robot systems, such as enabling UAVs to land on ground-based vehicles for recharging or other logistical support [
4], or for inspection of moving structures [
5]. In such dynamic environments, where multiple targets may be visible, it is crucial for UAV systems to not only track the primary target but also maintain awareness of other relevant objects in the field of view.
The problem of vision-based mobile target tracking (MTT) and following encompasses the detection of the target, followed by a target tracking algorithm that monitors the position of the target over time in the image. Finally, the position of the target in relation to the UAV is computed to give feedback to the controller and allow target following. Currently, there is a divide between two different strategies that perform target tracking in the image: single-object tracking (SOT) approaches, where only the target of interest is detected and tracked in the image, and multi-object tracking (MOT) approaches, where the system detects and tracks all the targets of interest in the image and assigns a specific ID to each of them.
Many works [
6,
7,
8,
9,
10] have been developed in this field; however, the most accepted approach is to tackle the detection and tracking step of the problem by using SOT, namely the kernelized correlation filter (KCF) algorithm [
11]. Correlation-based trackers such as KCF propose one-shot learning and show good performance without GPU acceleration, which makes them very appealing for embedded systems with computational limitations [
12]. However, KCF relies on the appearance model of the target being tracked. When occlusions occur, significant changes in target appearance can hinder accurate tracking, potentially leading to tracking failures. Hence, the KCF method is very susceptible to partial target occlusions and can only track the target in the image after it is selected in the first frame, which is usually performed manually [
6,
7,
10]. To partially tackle the problem of recovery after occlusion, an algorithm was developed in [
7] that analyses the motion between frames to detect movement indicative of the target. However, this can be susceptible to noise and dynamic environments. Another common approach is to use the Kalman filter (KF) in conjunction with the KCF [
9,
10]. Additionally, deep learning methods (YOLOv3) can be used to initialize the KCF tracker and redetect the target after a full occlusion, which can be performed if there are no similar targets in the image [
9].
On the other hand, MOT methods struggle with camera motion and view changes provoked by the UAV movement but work in dynamic multi-target environments. Recently, some works attempt to address this issue by improving camera motion models [
13,
14]. MOT applications have the advantage of working under a tracking-by-detection approach, which performs a detection step followed by a tracking step in every frame, instead of the single detection step at the beginning. The MOT approach allows for the consistent use of new deep learning-based detection methods (such as YOLOv8 [
15]) in order to increase the reliability of the system as a whole. The accuracy of a tracking-by-detection approach using the previous YOLOv7 version and the state-of-the-art BoT-SORT algorithm has been demonstrated, showing an effective solution for target occlusion and identity switching in pedestrian target tracking, even under poor illumination conditions and complex scenes [
16]. Furthermore, a similar system utilizing YOLOv8 and BoT-SORT within a synthetic aperture radar imaging framework has been shown to exhibit high precision in both detection and tracking, with real-time capabilities [
17]. To the best of our knowledge, despite recent advancements and the growing application of MOT, there have been limited real-world implementations of these techniques for target following in UAV systems. This research aims to address this gap by demonstrating the potential of MOT for target following and showcasing its ability to manage dynamic scenarios, thereby overcoming some of the limitations associated with current SOT approaches. This approach can enable new applications where multi-target information is a valuable asset, such as easy conditional target switching, crowd following, or enhanced redetection methods.
YOLOv8 [
18] was chosen for this work as it is one of the most recent additions to the YOLO family by Ultralytics on 10 January 2023, the original creators of YOLOv5, one of the most broadly used versions of the YOLO family. Moreover, the YOLO family is an ever-evolving field of research with some recent improvements shown with YOLOv9 [
19] and YOLOv10 [
20].
In addition to detection and tracking, accurately estimating the target’s position relative to the UAV is critical for effective target following. The discrepancy between the target’s pixel position and the centre of the image frame is often used to guide the UAV’s camera or gimbal toward achieving line-of-sight (LOS) with the target [
6,
7,
8,
9]. Several studies leverage this deviation to adjust the UAV’s orientation, either by steering a gimbal [
6,
7] or through lateral and vertical movements to maintain the target in focus [
8,
9]. Once LOS is established, maintaining a consistent distance between the UAV and the target becomes crucial. Distance estimation techniques, such as those based on the standard pinhole imaging model combined with extended Kalman filters (EKF), help manage observation noise and improve accuracy in dynamic scenarios [
7]. Additionally, some methods adjust the UAV’s following speed by calculating the proportion of the target’s size in the image, providing another mechanism for precise target following [
6,
9].
Control algorithms for mobile target following vary across studies, from direct 3D position inputs [
8,
9] to the use of proportional navigation (PN) with cascade PID controllers [
6]. Other works use a switchable tracking strategy based on estimated distance, transitioning between observing and following modes with different control methods [
7]. Many existing distance estimation methods suffer from limitations in precise target positioning, highlighting the need for more accurate methods. Incorporating a depth sensing module, when feasible, presents a promising solution to address this issue, as suggested by [
6].
This paper presents several novel contributions to the field of autonomous UAV-based target tracking and following systems. Firstly, we propose a vision-based system that leverages multi-target information for robust mobile target tracking and following. Secondly, we perform a comparative analysis of the state-of-the-art YOLOv8 object detection algorithm with leading multi-object tracking (MOT) methods: BoT-SORT and ByteTrack, evaluating their effectiveness in target following scenarios. Thirdly, we introduce a 3D flight control algorithm that utilizes RGB-D information to precisely follow designated targets. Fourthly, to address challenges like ID switches and partial/full occlusions in dynamic environments, we present a novel redetection algorithm that exploits the strengths of multi-target information. Finally, we comprehensively evaluate the proposed system through extensive simulations and real-world experiments, highlighting its capabilities and limitations.
This paper extend our previous work [
21], which initially demonstrated the feasibility of our approach in simulation using the MRS-CTU system. Building on this foundation, the current research introduces several key advancements. We have refined the algorithms, leading to improved performance in target tracking and following. Additionally, we present a comparative analysis between the ByteTrack and BoT-SORT algorithms, which was not covered in the earlier study. Most importantly, we validate the system with real-world experiments, marking a substantial step forward from the simulation-based results of our prior work. These real-world results not only confirm the system’s effectiveness but also demonstrate its practical applicability and robustness in actual drone operations.
The rest of the work is organized as follows: In
Section 2, the proposed system is explained.
Section 4 explains the experimental setup for simulations and real-world tests.
Section 5 shows the simulation results, and
Section 6 shows the real-world experiments. Concluding remarks and future work are discussed in
Section 7.
2. Proposed System
The system configuration is shown in
Figure 1. The overall system can be split in to four modules:
- 1.
Visual detection and tracking module;
- 2.
Distance estimation;
- 3.
Following flight controller;
- 4.
System mode switcher.
The “vision detection and MOT module” is responsible for detecting and tracking all objects of interest within the UAV’s field of view, including the designated target to follow. This module integrates three main components: the vision detection and MOT algorithm, which detects and tracks all objects in the image; the target acquisition stage, where a Kalman filter is employed to estimate the position of the selected target; and the redetection algorithm, which monitors the status of the target and attempts to redetect it in the case of occlusion or loss. The module processes RGB data from the UAV’s camera to differentiate between the primary target and other objects, such as bystanders, even during occlusions. When the redetection algorithm is triggered, a “redetection mode” signal is sent to the system mode switcher, which adjusts the control output accordingly.
The bounding box data produced by the vision detection and MOT module is then passed to the distance estimation stage, where the relative distance between the UAV and the target is calculated. This distance can also be determined using data from a depth sensor, such as an RGB-D camera, when available, particularly at larger distances where RGB data alone may be insufficient. Based on the position of the target relative to the image centre and the estimated distance, the controller computes the necessary yaw rate and velocity commands for the UAV. These commands are then relayed to the system mode switcher, which decides the appropriate inputs to send to the autopilot, depending on the current mode of the system. The autopilot controls the attitude of the UAV and sends the IMU data as input to the controller.
The proposed system will be tested in complex scenarios involving multiple moving pedestrians, where challenges such as occlusions and disturbances are likely to arise. However, the system is designed to be versatile and can be adapted for different objects of interest by modifying the training dataset used for the object detector and adjusting the control parameters as needed. Additionally, alternative distance estimation methods can be explored and applied to other object types, ensuring the system’s effectiveness across a variety of applications.
3. Visual Detection and Tracking Module
The first module is the visual detection and tracking module responsible for processing the images from the camera, identifying the target and performing necessary redetections.
3.1. Object Detection and Tracking
The first step in achieving target following involves detecting the target as well as any bystanders in the image and assigning each a unique identifier (ID) for continuous tracking. This is performed using a track-by-detection approach, where a real-time object detector identifies and classifies objects in each frame for subsequent tracking. For the detections, the system takes advantage of the state-of-the-art detector YOLOv8. The YOLOv8 model will analyse the image in real-time and provide a list of detections, highlighting all the objects in the image that are identified with the “person” class above a threshold confidence level. These detections contain: bounding box surrounding the object, class of the object (“person”, in this case) and confidence score in that result. An example of a detection scenario from simulation and in real-world scenarios is shown in
Figure 2.
The detections (composed of the bounding boxes and classes) are then linked over time by the tracker, by attributing unique identifiers (IDs) for each object, as also shown in
Figure 2. Two state-of-the-art trackers are considered: the BoT-SORT MOT and its predecessor, ByteTrack. YOLOv8 provides the detections via bounding boxes to the tracker which in turn performs data association with the previous frame to match each detection with a corresponding ID.
3.2. Target Acquisition and State Estimation
After the take-off and initialization of the detection and tracking algorithm, users can select a target to follow from the list of detected individuals. In this work, the algorithm automatically assigns the first detected person as the target to follow, storing their respective ID for tracking in subsequent frames. To fully leverage the capabilities of multi-object tracking (MOT), the positions of other detected individuals and their assigned IDs are also recorded to improve redetection in case the target is occluded.
After detection, a Kalman filter is used to predict the target’s position even during occlusions [
10]. The prediction model employed is based on a constant velocity motion model [
22], where the velocities are derived from the movement of the target’s bounding box centre in the image. The states used are the centre coordinates of the bounding box (
), the size of the bounding box (width -
w, height -
h) and the speed of movement in the image (
), calculated from the movement of the centre of the bounding box pixel coordinates. The states are defined in Equation (
1).
The observations are shown in Equation (
2).
The constant velocity model assumes the pixel coordinates of the centre of the target () move in the image with constant speed () and direction. This allows the Kalman filter to predict the target’s position during temporary occlusions. When the target is detected, the Kalman filter is updated with new observations. If the target is not visible, the Kalman filter’s predictions are used by the controller to continue tracking, as the target is expected to reappear shortly. If the target does not reappear after a predetermined number of frames, the Kalman filter’s predictions are incorporated into the redetection process to evaluate potential candidates for re-identification.
3.3. Redetection Algorithm
In vision-based detection and tracking of mobile targets from UAVs, occlusions pose a significant challenge, especially in densely populated areas. Redetection algorithms are crucial for reliable operations, particularly in scenarios involving multiple individuals. In such systems, the availability of multi-target information is a valuable asset to enhance the redetection algorithms and prevent the loss of the target.
Starting from the beginning of the target following process, the algorithm verifies detections using the assigned IDs from the YOLOv8+BoT-SORT or YOLOv8+ByteTrack setups by matching the received detections IDs with the locally stored detection list IDs. If the assigned target ID is absent in the detections, three possible situations may arise:
- 1.
Target ID Change: A limitation of MOT algorithms that occurs when the target is still visible in the image; however, the MOT algorithm assigned it a different identity than the previous frame. This may happen due to sudden movements of the target or the UAV that cause the tracker to misidentify the target. Since the algorithm identifies the target via the assigned ID, an identity change would cause a system failure if left unattended. Hence, it is necessary to update the locally defined target ID with the new assigned ID from the MOT module for continuous target following. The redetection algorithm allows the system to keep accurate target tracking and following, despite the MOT limitations.
- 2.
Target Missing: Initiates when the target is no longer detected in the image, potentially due to occlusions or obstructions by other objects, and a counter is initiated. During this step, the system continues to follow an estimate of the position of the target given by the Kalman filter. The brief window where the Kalman filter is used allows for a more robust system that will not immediately stop for short occlusions, like when two people cross paths. The Missing stage also adds some robustness to fast target movements that would cause the target to leave the image, by continuing the corrective manoeuvrer even after the target leaves the field-of-view.
- 3.
Target Lost: Declared after a set of consecutive frames where the target is absent, suggesting a potential loss or concealment. To prevent further deviation from the hidden target, the UAV will stop and hover, looking for potential redetection candidates until the target is redetected. In this stage, a relocation strategy could also be considered in future works to position the UAV in a better view to redetect it.
To perform the redetection process, the algorithm searches all detections for possible redetection candidates. Firstly, if the target ID is not found, it will attempt to check if a “Target ID Change” is in place. If no match is found, then it will enter the “Target Missing” mode and later the “Target Lost” mode.
A viable candidate for a successful redetection must fulfil the following conditions:
- 1.
It must represent a new detection not previously tracked, thereby excluding individuals already accounted for as they cannot be the target. This effectively excludes all the bystanders in the area and allows for redetections even in dynamic areas.
- 2.
Candidate detections are evaluated based on a minimum interception threshold with the latest estimated position of the missing target, determined using an IoU approach (Equation (
3)). The bounding boxes are scaled to allow for a greater recovery range.
- 3.
Among all the candidates that fulfil step 1 and step 2, it must be the one that scored the highest on step 2.
The IoU approach is a common evaluation method to determine similarity between two bounding boxes in an image. It is given by the relation in Equation (
3) where
and
are the respective bounding boxes which can be visualized in
Figure 3.
If the target is redetected during the “Target ID Change” or the “Target Missing” modes, the UAV will immediately resume regular operations and continue following the target. If the redetection happened during the “Target Lost” state, then the system mode switcher will change the input to the controller accordingly, in order to resume target following operations after a complete stop. Further explanation will be given in the system mode switcher section—
Section 3.6.
3.4. Distance Estimation
Estimating distance is essential for effectively tracking a target. This estimation can be roughly determined using the detection bounding box from the “visual detection and tracking module” or more precisely obtained using an installed depth sensor. However, depth information is not always available, either due to the absence of a depth sensor or when the target is beyond the sensor’s detection range.
In order to estimate distance (
d in
Figure 4) from the bounding box, we use a relation between the pixel height (
h) of the target in the image and a tuned constant value
C as shown in Equation (
4).
The target’s height in the image is chosen as the reference instead of the bounding box area (
) [
10] because it mainly depends on the actual height of the target and the distance to the target. Conversely, the width in the image can vary based on movement direction and limb position. The constant value can be pre-tuned for an average human height and adjusted during operations using better estimates from onboard depth sensors. For non-human targets like vehicles or robots, the bounding box area can be used.
To reduce susceptibility to observation noise, an exponential low-pass filter in a discrete-time system is applied as shown in (
5).
is the filtered value,
is the previous estimation and
is the current estimation. The
value is tuned to prevent higher frequency oscillations in the measurements that could disrupt the controller.
3.5. Following Flight Controller
The following flight controller consists of three separate controllers to achieve precise 3D tracking of the target: one for yaw rate, one for altitude (vertical velocity), and one for the horizontal velocity of the UAV.
This approach is inspired by [
8,
9], using the deviation between the target’s pixel location and the image frame’s centre as feedback to keep the UAV focused on the target. The yaw rate and altitude controllers are proportional controllers aimed at centring the target in the image. Normalized references for the horizontal, and vertical pixel positions are used (
6). The principle point coordinates
and
are obtained from the intrinsic parameters of the camera.
The heading reference for the controller and vertical velocity are then computed using (
7) and (
8), respectively, where
and
are the proportional gains.
The algorithm uses an aim-and-approach strategy similar to the PN strategy in [
6] to achieve target following, by using the yaw rate and altitude controllers to aim and the horizontal velocity controller to approach the target. The control output for the velocity is calculated by using a PI controller taking the estimated distance as input. The horizontal velocity controller error is defined according to (
9).
where the desired distance
is defined by the user. To achieve smooth control, the velocity value
V is passed through a slew rate limiter, which prevents sudden and aggressive manoeuvrers that could cause a loss of line of sight. The slew rate limiter also ensures the initial control output is zero, allowing for controlled initial movement. The limited rate of change,
, is defined by (
10).
In addition to the direct distance between the UAV and the target, the horizontal distance is considered a limiter to prevent the UAV from overshooting the target. This distance is calculated based on the current altitude of the UAV, an estimate of the target’s height, and the assumption that the target is in the same plane as the measured altitude, as shown in Equation (
11). This value can also be compromised by the quality of the altitude measurements, which is why it is only used as a limiter and not as the error feedback for the controller.
Figure 5 shows the workings of the high level controller. It takes the information from the RGB-D camera to compute the forward velocity, the vertical velocity and the yaw rate, which are then sent to the FCU—the Pixhawk autopilot.
3.6. System Mode Switcher
The system mode switcher is the responsible for providing the low-level controller with appropriate control references based on the mode of the system.
The system can operate in the following modes:
Search Mode: no target has been detected and identified yet. The UAV performs a predetermined search pattern until the target is found.
Adjusting Mode: only the yaw rate and vertical velocity controllers are used to centre the target in the image.
Following Mode: this may be defined as the mode for standard operations, when the target is identified and followed in 3D. This is also the chosen mode if a Target ID Change occurred.
Target Missing Mode: as previously defined in
Section 3.3, the target is not visible in the image.
Target Lost Mode: also defined in
Section 3.3, if the target has been missing for several consecutive frames it is assumed lost/hidden. Adequate procedures are taken to prevent further deviations from a target hidden behind an obstacle.
Figure 6 presents a flowchart representing the system modes and their respective interactions.
Following take-off and systems check, the UAV will enter “Search Mode”. In this mode, the UAV will perform a predetermined search pattern until the target is found. In this work, the UAV will perform a climb until the defined safety altitude and slowly rotate until a person is detected. The first person detected will be set as the the target to follow. The system allows the search mode to be redefined according to the specifications of the mission (e.g., making it possible for the user to select the specific target to follow).
After the target is detected and identified, the UAV will enter “Adjusting Mode”, adjusting the target to the centre of the image as best as possible, considering altitude safety limits by controlling the altitude and yaw rate. This allows the system to have a more controlled first approach to the target. After the “Adjusting Mode” timer is over, the system enters regular operations with the “Following Mode”, which feeds the output from all the controllers to the autopilot.
Regarding the first recovery mode—“Target Missing Mode”—the system mode switcher will continue to send the commands from all the controllers however, the estimated distance and subsequently the error will be calculated using the predictions from the Kalman filter.
Finally, in the second recovery mode—“Target Lost Mode”—the system mode switcher will set the UAV to hover. This is performed to prevent further deviation from the lost target, here assumed to be hidden behind some obstacle. The system will remain in this mode until a successful redetection is obtained. Unlike the “Target Missing Mode” that directly shifts to “Following Mode”, here, once the target is redetected, the system mode switcher will initiate the “Adjusting Mode” for a brief period and reset the rate limiter and the low pass filter previous values. This will ensure the target is properly redetected and followed.
5. Simulation Experiments
To assess system performance in open environments with multiple pedestrians, five tests were conducted in a 900-square-meter obstacle-free area, featuring three to six pedestrians walking at an average speed of 1 to 1.5 m per second. Each test spans approximately 2 min, with randomized and not pre-set trajectories. The desired distance () is set to 11 m. In order to perform a comparison between MOT methods, the same world conditions (target movements) are replicated in the tests conducted using BoT-SORT and ByteTrack.
Results are summarized for the BoT-SORT tests in
Table 1 and ByteTrack in
Table 2. Across all the experiments, the UAV successfully tracked and followed the target, covering a total travel distance of 883.5 m, while the target travelled for 1264.6 m. “Depth Use” shows that the depth estimation is used over 50% of the time in all the tests which is expected given the desired distance values of 11m. Taking into consideration the dynamic use of both estimation methods for distance estimation, the error values are within a range that allows for accurate target following in dynamic scenarios. The standard deviation for the estimation values is relatively high compared to the average error mainly due to the differences in accuracy for both estimation methods (using the bounding box height or the depth information). Regarding computational load, the BoT-SORT algorithms averages around 12.07 Hz ± 1.57 Hz which is adequate for real-time applications. In addition, the BoT-SORT algorithm proves to be very effective in target following, maintaining the target ID in all but the fourth experiment where the “Target ID Change” redetection method had to be used once. In the ByteTrack experiments, there is a substantial tracking performance drop from the BoT-SORT results, which derives from the lack of camera movement compensation enhancements in the ByteTrack algorithm. Despite the higher number of Target ID Changes, the redetection algorithm developed is able to handle the limitations and continue to follow the target under these conditions. Also, the efficiency of the ByteTrack algorithm in comparison with BoT-SORT becomes evident with an average of 17.56 Hz ± 0.96 Hz which translates to a 31.25% decrease in runtime from the BoT-SORT results.
To evaluate the capabilities of the system to maintain line-of-sight and keep the target in the centre of the image, the heatmap of the target’s bounding box centre position in the image was compiled across the five tests for each tracker and shown in
Figure 10. Results are similar for both trackers with the target remaining predominantly centred. Target deviations along the vertical axis can be attributed to the lack of gimbal stabilization for the camera, coupling forward/backwards movement with a downward/upwards pitch manoeuvrer. It is believed that introducing camera stabilization would greatly improve these results by decoupling both movements.
To further evaluate the full system, an experiment was conducted that includes partial and full occlusions up to 10 s where the system goes through the various redetection modes, such as Target ID Change, Target Missing and Target Lost. Similarly to the previous experiments, in this case the same scenario was also ran for both trackers in order to compare results which are expressed in
Table 3.
Overall, the results align closely with those of obstacle-free experiments, with the only notable difference being a decrease in “Visual Accuracy.” This decrease can be attributed to instances where the target experiences full occlusions during the course of the experiment. The differences between the YOLOv8+BoT-SORT setup and the YOLOv8+ByteTrack are even more pronounced in the long-term experiments with the BoT-SORT tracker performing much better in keeping the IDs of the targets while the ByteTrack tracker has the better frame rate. Considering the several instances of partial and full occlusions and the number of ID changes, especially in the ByteTrack experiments, it can be said that the redetection methods performed well during both experiments, maintaining target following throughout the full experiments. Nevertheless, the robustness of the BoT-SORT algorithm makes it more suitable for target following applications where the dynamic scenarios can provoke unforeseen circumstances. The distance to the target is correctly estimated with an average error of 0.599 m ± 0.502 m for the BoT-SORT and 0.587 m ± 0.640 m for ByteTrack experiments with some higher error when the target changes direction. The estimated distance versus real distance and the respective errors are shown for the BoT-SORT experiments in
Figure 11. A demonstration video showcasing the entire experiment using YOLOv8+BoT-SORT is available at the following address (
https://youtu.be/YrquRNc5tKM, accessed on 1 September 2024).