Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Visual Object Tracking Based On The Motion Predict

Uploaded by

Nghĩa Bùi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Visual Object Tracking Based On The Motion Predict

Uploaded by

Nghĩa Bùi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

drones

Article
Visual Object Tracking Based on the Motion Prediction and
Block Search in UAV Videos
Lifan Sun 1,2,3, *, Xinxiang Li 1 , Zhe Yang 4 and Dan Gao 1

1 School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China;
lxx@stu.haust.edu.cn (X.L.); d.gao@haust.edu.cn (D.G.)
2 Longmen Laboratory, Luoyang 471000, China
3 Henan Academy of Sciences, Zhengzhou 450046, China
4 Xiaomi Technology Co., Ltd., Beijing 100102, China; yangzhe11@xiaomi.com
* Correspondence: lifan.sun@haust.edu.cn

Abstract: With the development of computer vision and Unmanned Aerial Vehicles (UAVs) tech-
nology, visual object tracking has become an indispensable core technology for UAVs, and it has
been widely used in both civil and military fields. Visual object tracking from the UAV perspective
experiences interference from various complex conditions such as background clutter, occlusion, and
being out of view, which can easily lead to tracking drift. Once tracking drift occurs, it will lead
to almost complete failure of the subsequent tracking. Currently, few trackers have been designed
to solve the tracking drift problem. Thus, this paper proposes a tracking algorithm based on mo-
tion prediction and block search to address the tracking drift problem caused by various complex
conditions. Specifically, when the tracker experiences tracking drift, we first use a Kalman filter
to predict the motion state of the target, and then use a block search module to relocate the target.
In addition, to improve the tracker’s ability to adapt to changes in the target’s appearance and the
environment, we propose a dynamic template updating network (DTUN) that allows the tracker to
make appropriate template decisions based on various tracking conditions. We also introduce three
tracking evaluation metrics: namely, average peak correlation energy, size change ratio, and tracking
score. They serve as prior information for tracking status identification in the DTUN and the block
prediction module. Extensive experiments and comparisons with many competitive algorithms on
Citation: Sun, L.; Li, X.; Yang, Z.; Gao, five aerial benchmarks, UAV20L, UAV123, UAVDT, DTB70, and VisDrone2018-SOT, demonstrate that
D. Visual Object Tracking Based on our method achieves significant performance improvements. Especially in UAV20L long-term track-
the Motion Prediction and Block ing, our method outperforms the baseline in terms of success rate and accuracy by 19.1% and 20.8%,
Search in UAV Videos. Drones 2024, 8, respectively. This demonstrates the superior performance of our method in the task of long-term
252. https://doi.org/10.3390/ tracking from the UAV perspective, and we achieve a real-time speed of 43 FPS.
drones8060252

Academic Editors: Oleg Yakimenko


Keywords: object tracking; block search; motion prediction; dynamic templates; evaluation metrics
and Anastasios Dimou

Received: 21 April 2024


Revised: 28 May 2024 1. Introduction
Accepted: 5 June 2024
In recent years, UAV technology and industry have developed rapidly. Object tracking
Published: 7 June 2024
technology from the UAV perspective has been widely used and has become the core
technology of UAVs [1–3]. Visual target tracking technology is not only widely used in
civil fields such as security, logistics, and rescue but also plays an important role in military
Copyright: © 2024 by the authors. fields such as intelligence collection, guidance, and remote sensing [4]. However, object
Licensee MDPI, Basel, Switzerland. tracking technology from the UAV perspective faces several challenges and problems [5].
This article is an open access article (1) Tracking robustness under complex scenarios: there are many challenging statuses in
distributed under the terms and object tracking tasks from an aerial perspective, such as fast motion, object deformation,
conditions of the Creative Commons background clutter, occlusion, and being out of view. (2) Resource constraints: UAVs have
Attribution (CC BY) license (https:// limited computational resources and energy, and must achieve efficient object tracking
creativecommons.org/licenses/by/ and meet the real-time requirements with limited resources. Therefore, researchers still
4.0/).

Drones 2024, 8, 252. https://doi.org/10.3390/drones8060252 https://www.mdpi.com/journal/drones


Drones 2024, 8, 252 2 of 29

need to explore and solve the problems faced by object tracking technology from the
UAV perspective.
Traditional object-tracking algorithms usually use filtering methods (e.g., Kalman
filter [6]) to estimate the motion state of the object, and predict the position of the object
by establishing state equations and observation equations. However, these algorithms
are prone to the problem of tracking drift when the object undergoes sudden changes in
motion or when the sensor wobbles. In recent years, with the development of deep learning
(DL) technology, researchers have begun to apply DL techniques to visual object tracking
and have made significant progress. Due to their excellent performance, Siamese network
trackers [7,8] have been favored by many researchers. In visual object tracking, Siamese
networks are deep neural networks for learning target features and computing object simi-
larities. It consists of two similar convolutional neural networks (CNNs) that process the
template image and the search image respectively and output their feature representations.
Some derived methods such as STMTrack [9], SiamGAT [10], and SiamRN [11] are based on
Siamese networks. These methods usually improve the training method of the algorithm
and the similarity calculation process but do not address the tracking drift problem caused
by disturbances such as object occlusion and being out of view. As a result, these methods
are not advantageous for long-term tracking tasks in complex scenes.
With the improvement of computing resources and the success of the Transformer
architecture in the field of natural language processing, target tracking algorithms based
on Transformer architecture have become a hot research topic, such as DropTrack [12],
SwinTrack [13], and AiATrack [14]. In visual object tracking, tracking algorithms based
on the Transformer architecture mainly consist of an encoder and a decoder. The encoder
and decoder consist of multiple identical encoder layers, each containing a self-attention
mechanism and a feedforward neural network. The self-attention mechanism can model
associations between different locations in the input stream. Feedforward neural networks
introduce nonlinear transformations into the feature representation. SwinTrack [13] uses
the Transformer architecture for both representation learning and feature fusion. The
Transformer architecture allows for better feature interaction for tracking than the pure
CNN framework. AiATrack [14] proposes an attention-in-attention (AiA) module, which
enhances appropriate correlations by seeking consensus among all correlation vectors.
The AiA module can be applied to both self-attention blocks and cross-attention blocks to
facilitate feature aggregation. It is due to the attentional mechanism of the Transformer
architecture and the feed-forward neural network that the tracker performs the tracking
task with better robustness and generalization. However, for the same reason, trackers
based on the Transformer architecture are more computationally intensive. Therefore,
certain limitations exist in performing tracking tasks on UAV platforms.
In order to solve the tracking drift problem caused by complex conditions such as
occlusion and being out of view, this paper proposes a visual object tracking algorithm
based on motion prediction and block search and improves it for the tracking task from
the UAV perspective. We introduce three metrics for evaluating tracking results: average
peak correlation energy (APCE), size change ratio (SCR), and tracking score (TS). These
metrics aim to jointly identify the tracking status and provide prior information for the
proposed dynamic template updating network (DTUN). The proposed DTUN employs
the optimal template strategy based on different tracking statuses to improve the tracker’s
robustness. We utilize the Kalman filtering for motion prediction of the object when the
object is temporarily occluded or subject to other disturbances that cause tracking failure.
In cases of near-linear motion, the Kalman filter can be applied simply and efficiently. For
cases of the nonlinear motion of the object, the Kalman filter predicts the approximate
motion direction of the object, which is crucial for the block-search module. The proposed
block-prediction module mainly solves the long-term tracking drift problem. Considering
cases of tiny objects from the UAV perspective, we process the enlarged search region
with block search, significantly improving search efficiency for tiny objects. Overall, our
method performs well in solving the tracking drift problem in long-term tracking caused
Drones 2024, 8, 252 3 of 29

by complex statuses such as object occlusion and being out of view. Figure 1 shows a
visualization of our method compared with the baseline algorithm. When tracking drift
occurs due to object occlusion, it will continue if no action is taken (as shown by the blue
bounding box). In contrast, our method employs a target prediction and block search
framework, which can effectively relocate the tracker to the object (as shown by the red
bounding box).

(a) (b) (c) (d)

Figure 1. Visualization results of the proposed MPBTrack method compared with the baseline on
the UAV123 dataset. The first row shows the bounding box of the object, where green, red, and blue
colors indicate the ground truth, our method and the baseline method, respectively. The second and
third rows display the response map results for our method and the baseline method, respectively.
(a) represents the tracking drift condition caused by the object being out of view. (b,c) represent the
tracking drift conditions caused by the occlusion. (d) represents the tracking error caused by the
occlusion and the interference from a similar object.

Extensive experiments on five aerial datasets, UAV20L [15], UAV123 [15], UAVDT [16],
DTB70 [17], and VisDrone2018-SOT [18], demonstrate the superior performance of our
method. In particular, on the large-scale long-term tracking dataset UAV20L, our method
achieves a significant improvement of 19.1% and 20.8% in terms of success and accuracy,
respectively, compared to the baseline method. In addition, our method achieves a tracking
speed of 43 FPS, which far exceeds the real-time requirement. This demonstrates the high-
efficiency and high-accuracy performance of our method in performing the object tracking
task from the UAV perspective. Figure 2 shows the results of our method compared with
other methods in terms of success rate and tracking speed.
In summary, the main contributions of this paper can be summarized in the following
three aspects:
(1) We introduce three evaluation metrics: the APCE, SCR, and TS, which are used to
evaluate the tracking results for each frame. These evaluation metrics are used to
jointly identify the tracking status and provide feedback information to the DTUN in
the tracking of subsequent frames. The proposed DTUN adjusts the template strategy
Drones 2024, 8, 252 4 of 29

according to different tracking statuses, enabling the tracker to adapt to changes in


the object and the tracking scenario.
(2) We propose a motion prediction and block search module. When tracking drift occurs
due to complex statuses (e.g., occlusion and being out of view), we first predict the
motion state of the object using a Kalman filter, and then utilize the block search to
re-locate the object. Its performance is excellent for solving the tracking drift problem
from the UAV perspective.
(3) Our proposed algorithm achieves significant performance improvement on five aerial
datasets, UAV20L [15], UAV123 [15], UAVDT [16], DTB70 [17], and VisDrone2018-
SOT [18]. In particular, on the long-term tracking dataset UAV20L, our method
achieves 19.1% and 20.8% increase in success and precision, respectively, compared to
the baseline method, and achieves 43 FPS real-time speed.

Success VS. Speed on UAV20L

DropTrack MPBTrack(Ours)
ROMTrack
0.70 OSTrack-256
RTS
ARTrack SLT-TransT

ToMP-101
PrDiMP-50
0.65
Success

TaMOs STMTrack
0.60 TransT50

SiamBAN

SiamCAR
0.55 CNNInMo
SLT-SiamAttn

0 10 20 30 40 50
FPS

Figure 2. Performance comparison of our method with others on the UAV20L long-term tracking
dataset in terms of success and speed.

This paper is organized as follows: Section 2 describes other work related to our
approach. Section 3 describes the framework and specific details of our proposed approach.
Section 4 shows the results of experiments on five aerial datasets and compares them with
other methods. Finally, our work is summarized in Section 5.

2. Related Work
Visual object tracking is an important task in computer vision that has developed
rapidly in recent years. It aims to accurately track objects in video sequences in real time. In
addition, the application of visual object tracking in the field of UAVs is receiving increasing
attention and plays an important role. In this chapter, the research on visual object tracking
algorithms and their development on UAVs is discussed.

2.1. Object Tracking Algorithm


With the development of computer vision and DL, researchers began to apply DL to
visual object tracking. The introduction of the Siamese network has made a significant
contribution to the progress of visual object tracking. It learns the representation of the
template and search images and calculates the similarity between them. The template and
search images are fed into two networks with shared weights. Siamese network-based
trackers have achieved high accuracy and real-time performance in object-tracking tasks.
Drones 2024, 8, 252 5 of 29

Recently, the Transformer model achieved great success in natural language processing and
computer vision with the improvement of computational resources. Transformer-based
tracking algorithms achieve more accurate tracking by introducing an attention mechanism
to model the relationship between the object and its surrounding context. These methods
take a sequence of object features as input and use a multilayer Transformer encoder to
learn the object’s representation.
In recent years, Siamese network-based tracking algorithms have been extensively
researched. Li et al. proposed SiamRPN [19] by introducing the regional proposal network
(RPN) [20] into the Siamese network. RPN consists of two branches, object classification
and bounding box regression. The introduction of RPN eliminates the traditional multiscale
testing and online fine-tuning process, and greatly improves the accuracy and speed of
object tracking. Guo et al. argued that anchor-based regression networks require tricky
hyperparameters to be set manually, so they proposed an anchor-free tracking algorithm
(SiamCAR) [21]. Compared to the anchor-based approach, the anchor-free regression
network has fewer hyperparameters. It directly calculates the distance from the center
of the object to the four edges during the bounding box regression process. Fu et al.
argued that the fixed initial template information has been fully mined and the existing
online learning template update process is time-consuming. Therefore, they proposed a
tracking framework based on space-time memory networks (STMTrack) [9] that utilizes
the historical templates from the tracking process to better adapt to changes in the object’
appearance. RTS [22] proposed a segmentation-centered tracking framework that can
better distinguish between object and background information. It can generate accurate
segmentation masks in the tracking results, but there is a reduction in the tracking speed.
Although the above trackers improve tracking accuracy by improving the model training
method and bounding box regression, they do not effectively solve the tracking drift
problem due to complex situations. In addition, Zheng et al. [23] argued that the imbalance
of the training data makes the learned features lack significant discriminative properties.
Therefore, in the offline training phase, they made the model more focused on semantic
interference by controlling the sampling strategy. In the inference phase, an interference-
aware module and a global search strategy are used to improve the tracker’s resistance to
interference. However, this global search strategy is not good for tracking tiny objects from
the UAV perspective, especially when there are similar objects around, background clutter,
or low resolution.
After Siamese networks, Transformer large-model-based trackers also achieved ex-
cellent results. Yan et al. [24] presented a tracking architecture with an encoder–decoder
transformer (STARK). The encoder models the global spatio-temporal feature information
of the target and the search region, and the decoder predicts the spatial location of the
target. The encoder–decoder transformer captures remote dependencies in both spatial and
temporal dimensions and does not require subsequent hyperparameter processing. Cao
et al. proposed an efficient Hierarchical Feature Transformer (HiFT) [25], which inputs hier-
archical similarity maps into the feature transformer for the interactive fusion of spatial and
semantic cues. It not only improves the global contextual information but also efficiently
learns the dependencies between multilevel features. Ye et al. [26] argued that the features
extracted by existing two-stage tracking frameworks lack target perceptibility and have
limited target–background discriminability. Therefore, they proposed a one-stage tracking
framework (OSTrack), which bridges templates and search images with bidirectional in-
formation flows to unify feature learning and relation modeling. To further improve the
inference efficiency, they proposed an in-network candidate early elimination module to
gradually discard candidates belonging to the background. The above Transformer model-
based tracker achieves significant performance improvement in visual object tracking, but
it requires high computational resources, which are not advantageous for applications on
UAV platforms.
Drones 2024, 8, 252 6 of 29

2.2. Tracking Algorithms in the UAV


Visual object-tracking technology is widely used as the core technology of UAV appli-
cations. The fast and automatic tracking of moving targets can be achieved by equipping
UAVs with visual sensors and object-tracking algorithms. However, tiny objects detected by
aerial sensors are easily occluded and susceptible to environmental influences. In addition,
there are limitations in the computational power of UAV devices. Therefore, researchers
have focused on developing, researching, and improving object-tracking algorithms specif-
ically for aerial tracking to address these challenges.
Cao et al. [25] argued that using only the last layer of image features will reduce
the accuracy of aerial tracking, and simply using multiple layers of features will increase
the online inference time. Therefore, they proposed the hierarchical feature Transformer
framework. This framework enables the interactive fusion of spatial and semantic in-
formation to achieve efficient aerial tracking. TCTrack [27] proposed a temporal context
information tracking framework that can fully utilize the temporal context of aerial tracking.
Specifically, they proposed online temporal adaptive convolution to enhance temporal
information of spatial features and an adaptive temporal converter that uses temporal
knowledge for encoding-decoding. The proposed online temporal adaptive convolution
can be used to enhance the temporal information of spatial features. Since visual trackers
may perform the tracking task at night, low-light conditions are unfavorable for tracking.
Therefore, Li et al. [28] proposed a novel discriminative correlation filter-based tracker
(ADTrack) with illumination adaptive and anti-dark capability. This tracker first extracts
the illumination information from the image, then performs enhancement preprocessing,
and finally generates an object-aware mask to realize object tracking. For airborne tracking,
if the object is within a complex environment, tracking performance is severely compro-
mised. With the application of depth cameras on UAVs, adding depth information can
more effectively deal with complex scenes such as background information interference.
Therefore, Yang et al. [29] proposed a multimodal fusion and feature-matching algorithm
and constructed a large-scale RGBD (RGB-Depth map) tracking dataset. In addition, RGB-T
(RGB-thermal) [30] tracking is also an effective means in the field of visual object tracking
to solve the difficulty of UAV tracking in complex scenes. However, adding additional
sensor devices also significantly increases the cost and flight burden of UAVs.

3. Proposed Method
This section provides a comprehensive overview of the proposed method. Section 3.1
outlines the overall structure, Section 3.2 delves into the dynamic template updating net-
work, Section 3.3 discusses the specifics of the search–evaluation network, and Section 3.4
presents details related to the block-prediction module.

3.1. Overall Framework


Figure 3 illustrates the structure of the proposed MPBTrack algorithm, comprising
three parts: the dynamic template updating network, the search–evaluation network, and
the block-prediction module. The dynamic template updating network adjusts the number
of templates according to the evaluation results regarding the tracking condition. The
tracking results for each frame are filtered, and high-quality templates are stored in the
template memory. The template extraction module extracts a corresponding number of
diverse high-quality template features from the template memory and then concatenates
these template features with the initial templates. The number of templates used for
the current frame tracking is determined by the received APCE feedback results. The
search–evaluation network calculates the similarity between the template image and the
search image and evaluates the tracking results. Following the similarity calculation, the
network obtains results for the object’s classification, center-ness, and regression from three
branches. The tracking status evaluation network yields three metrics: APCE , TS, and
SCR. The block-prediction module first recognizes the tracking status based on the three
joint evaluation metrics. When the tracker detects tracking drift caused by occlusion or
Drones 2024, 8, 252 7 of 29

other interference, it will predict the target’s motion trajectory using a Kalman filter. If the
object’s position is not accurately predicted within 20 frames, a block search in an expanded
area is utilized to relocate the object.

Dynamic Template Updating Search-Evaluation Network Block-Prediction Module

Feature Memory
Classification Expand Block
Template Extraction Search
φ1(z)
Mechanism APCE

φ2(z)
Search APCE Score
Search Region Feature
φ3(z)
Center-ness
φ4(z) Status
SCR Recognizer
Size Change Ratio
Template
Initial Template Feature Regression
φi-2(z) Tracking Score
concat. TS
φi-1(z)
(cx,cy)

φi(z) Feature
Concatenate
Motion Prediction

Previous Template

Figure 3. Overall network framework of MPBTrack. “⋆” indicates cross correlation operation.

3.2. Dynamic Template Updating Network


Since fixed templates alone cannot adapt to changes in the object’ appearance and
tracking scenarios (e.g., illumination variation and background clutter), they can easily lead
to tracking failures. Therefore, we propose a dynamic template updating network to adapt
to the changes in the object’ appearance and tracking scenarios during the tracking process.
For normal tracking conditions, the tracker can achieve accurate tracking by using only a
few templates. For complex tracking conditions, we will employ more historical templates
to adapt them to improve the robustness of object tracking. The method dynamically
switches the number of templates according to the tracking status, which improves the
tracking robustness as well as the computational efficiency.
Figure 4 shows the structure of the DTUN. The network includes a dynamic template
in addition to the traditional initial template. The dynamic templates are obtained from
high-quality regions filtered by the tracking results. The dynamic features extracted by the
feature extraction network are stored in the feature memory. In poor tracking status (e.g.,
background clutter, motion blur, and object occlusion), the target region of the tracking
result is affected when the template quality is low. If low-quality templates are stored
in the template memory, the subsequent tracking accuracy will be greatly affected. The
tracking score is the most intuitive response to the quality of the tracking region. In order
to improve the quality of templates in the template memory, we filter the tracking results
to ensure that high-quality templates are stored in the feature memory while low-quality
templates are discarded.

Initial Feature
Initial Template
APCE Quality
Feedback

Dynamic Template Dynamic Feature APCE


Search

+ Network
TS

Concatenation
Response Map
Feature
Search Feature
Search Image
High Low
Quality Quality
Feature Memory

Abandon

Figure 4. Dynamic template updating network structure. “+” indicates concatenation operation. The
red box in the response map represents the maximum tracking score.
Drones 2024, 8, 252 8 of 29

The TS metric is employed to assess the quality of the tracked region for each frame. If
the TS of the current region exceeds a predefined threshold (0.6), it will be incorporated
into the feature memory as a new template, following the processes of cropping and
feature extraction. Specifically, the new template will perform the following operations
to yield the optimal template features: (1) obtain the cropping size of the current image
according to Equation (23), and then crop with the target as the center point; (2) resize the
cropped region to 289 × 289 to obtain a template containing the foreground region and the
background region; (3) generate a foreground–background mask map with the same size
as the foreground and the background in the template; and (4) the cropped template and
the mask map are jointly fed into the feature extraction network to obtain the template
features (the purpose is to improve the tracker’s discrimination between the target and the
background), and finally the template features are stored in the feature memory.
The APCE score measures the quality of the response map of the tracking results
(described in more detail in Section 3.3.2). The tracker receives feedback on the APCE
results at each frame and transforms the APCE values into the corresponding number of
templates. The equation for this transformation process is as follows:

Nmax − 1
T = Nmax − (1)
1 + exp(− a( APCE − b))

where T denotes the number of templates, and Nmax denotes the maximum value of the
template range. We set the template range to 1–10, and a and b denote the slope and
horizontal offset of the function, respectively. If the tracking quality of the previous frame is
high, a small number of templates will be used in the next frame to achieve better tracking
results. When the tracker receives a lower tracking quality in the previous frame, more
historical templates will be utilized to adapt to the current complex tracking situation.
The template extraction mechanism is an important part of the dynamic template-
updating process. A large number of historical templates are stored in the template memory,
and effective utilization of these templates is crucial to the robustness and accuracy of
tracking. By using Equation (1), we can calculate the number of templates that need to
be used in the next frame, and then use the template extraction mechanism to select high-
quality and diverse templates from the historical ones, suitable for tracking in the next
frame. The template extraction can be denoted as:

t
τi = ⌊ ⌋ × i, i = 0, 1, 2, . . . , N (2)
N
 
Tj = maxrag τj , τj+1 , j = 0, 1, 2, . . . , N − 1 (3)
Tcon = concat( T0 , . . . , TN −1 ) (4)
where t is the total number of templates in the library, and N is the number of tracked tem-
plates. tau j denotes the j-th segmentation point, and max arg [ φ1 , φ2 ] denotes the maximum
value in the interval [ φ1 , φ2 ]. concat(, ) denotes the concatenation operation. Assuming
that N templates are needed for the next frame, the specific extraction steps are as follows:
(1) The initial template is necessary, as it contains primary information about the tracking
target. Note that we exclude the last frame template, as it may introduce additional inter-
fering information. (2) The templates in the template memory are sequentially divided into
N − 1 segments, and then the template with the highest tracking score is selected from each
segment to serve as the optimal tracking template. This extraction mechanism can enhance
the diversity of templates and extract high-quality templates, thus significantly improving
the robustness of the tracker.
Drones 2024, 8, 252 9 of 29

3.3. Search–Evaluation Network


3.3.1. Search Subnetwork
The search network utilizes STMTrack [9] as the baseline and GoogleNet [31] as the
feature extraction network. As shown in Figure 5, the template image z and search image
x undergo feature extraction to obtain their respective features φ(z) and φ( x ). In the
tracking process, the extracted historical template features are concatenated, and then
the concatenated template features and the search image features are subjected to a cross-
correlation operation to obtain the response map R∗ . The process can be represented as:

R∗ = concat( φ1 (z), . . . , φi (z))⋆ φ( x ) (5)

where φ denotes the feature extraction operation, ⋆ denotes the cross correlation operation,
and concat(, ) denotes the concatenation operation. Following the response map, the
classification convolutional neural network and regression convolutional neural network
are used to obtain the classification feature maps Rcls and regression feature maps Rreg ,
respectively. The purpose of the classification branch is to classify the target to be tracked
from the background. The classification branch includes a center-ness branch that boosts
confidence for positions closer to the center of the image. Multiplying the classification
response map scls with the center-ness response map sctr suppresses the classification
confidence for locations farther from the center of the target, resulting in the final tracking
response map. The purpose of the regression branch is to determine the distance from the
center of the target to the left, top, right, and bottom edges of the target bounding box in
the search image.
Classification

Template Feature scls


Template Image Rcls
Conv

CNN
zz CNN Center-ness
zz R*

φ1(z) φi(z) Conv sctr


backbone
Rreg
Regression

x CNN φ(x) CNN


Conv

LTRB
Search Image Search Feature

Figure 5. Search network structure. “⋆” denotes the cross correlation operation. R∗ represents the
response map.

During the training phase of the network model, the classification branch is trained
using the focal loss function, which can be expressed as:

Lcls = −αt (1 − pt )γ log( pt ) (6)

where (1 − pt )γ is a modulating factor, γ is a tunable focusing parameter, and pt is the


predicted probability of a positive or negative sample. The regression branch utilizes
intersection over union (IoU) as a loss function, which can be expressed as:

Intersection( B, B∗ )
Lreg = 1 − (7)
Union( B, B∗ )
Drones 2024, 8, 252 10 of 29

where B is the predicted bounding box and B∗ is its corresponding ground-truth bounding
box. The center-ness branch uses a binary cross-entropy loss (BCE), which can be written as:

Lcen = c(i,j) log p(i,j) + (1 − c(i,j) ) log (1 − p(i,j) ) (8)

where p(i,j) represents the center-ness score at point (i, j). c(i,j) denotes the label value at
position (i, j). The final loss function is:

1 λ λ
Loss =
N ∑ Lcls + N ∑ Lreg + N ∑ Lcen (9)
x,y x,y x,y

where N denotes the number of point, λ is the weight value.

3.3.2. Evaluation Network


In the evaluation network, we introduce three evaluation metrics: APCE, SCR, and
TS. These three metrics will be jointly used for status identification in the block-prediction
module (detailed in Section 3.4). In addition, APCE will also be fed back to the template
extraction mechanism as prior information about the tracking quality. The tracking score
will also be used to filter high-quality templates and place them in the template memory.
Average Peak Correlation Energy: Inspired by the LMCF [32] method, we introduce
the APCE metric to evaluate the quality of the target region in the tracking response map.
When the target is not disturbed by poor conditions, the APCE value is high, and the 3D
heatmap of the tracking response map shows a stiff single peak shape. If the target is
affected by a cluttered background or occlusion, the APCE value will be lower, and the 3D
heatmap of the tracking response map will show a multipeak shape. The more the target is
affected, the lower the APCE value. The APCE is calculated as follows:

| Fmax − Fmin |2
APCE = ! (10)
2
mean ∑ ( Fw,h − Fmin )
w,h

where Fmax and Fmin represent the maximum value and minimum value of the response
map of the tracking result, respectively. Fw,h is the response values in the response map at
position (w, h).
Figure 6 shows the APCE values and response maps of the tracking results for different
statuses. The APCE values are high under normal conditions, and the response map of the
tracking result exhibits a stiff single peak form. However, when the target is affected by
a cluttered background, the APCE value decreases, and the response map of the tracking
result begins to show a trend of multiple peaks. When the target is occluded, the value
of APCE decreases significantly, and the response map shows a low multipeak form. Our
proposed DTUN is able to adapt to tracking situations where the impact on the target
is small. However, when the target suffers from more serious impacts (e.g., occlusion),
it will lead to tracking drift and further to complete failure in subsequent tracking if it
cannot be effectively addressed. With the APCE metric, we can better measure the tracking
results and determine the extent to which the target is affected by the external environment.
This provides a more effective a priori guide for us to identify the tracking status in the
block-prediction module.
Size Change Ratio: During the target-tracking process, the size of the target typically
changes continuously, without any significant changes between consecutive frames. If
the size of the target changes significantly in consecutive frames, external interference
has likely caused the tracker to drift. Smaller influences are usually insufficient to cause
tracking drift, and only severe influences (e.g., target occlusion) can cause tracking drift. To
Drones 2024, 8, 252 11 of 29

identify tracking drift in the block-prediction module, we introduce the size change ratio as
an indicator. The size change ratio is expressed as:
2
m ∑im=⌊ m ⌋ Fwi ×h
2
SCR = (11)
Fwc ×h

where m denotes the number of templates in the template memory. Fwi ×h denotes the size
of the i-th target template in the template memory, and Fwc ×h denotes the template size of
the current frame.

(a) (b) (c)

Figure 6. The heatmaps and APCE values of response maps for different tracking statuses. (a) Normal
situation, (b) background clutter, and (c) object occlusion.

Specifically, in order to obtain a more accurate SCR value, target sizes of poor quality
are discarded. The average of the latter m/2 target sizes in the template memory is used
to calculate the historical target size. Then, the ratio of the historical target size relative to
the current target size is calculated. If this ratio falls within the threshold, the target size
is considered to be within the normal range of variation. Otherwise, a sudden change in
size is considered to have occurred. Such sudden size changes are often caused by tracking
drift due to severe impacts (e.g., occlusion) on the target. Therefore, we use the SCR as a
tracking status identification metric to assess whether the tracker is experiencing tracking
drift. Considering that the rapid movement of the sensor may also cause the target size to
change rapidly in a short period, which may lead to the misjudgment of the tracker, so we
add the tracking score and the APCE score to support the judgment. Tracking drift is only
considered to have occurred when all three conditions meet the threshold, which is then
transmitted to the condition recognizer. This auxiliary judgment prevents the misjudgment
of the tracker when the tracking condition is good and effectively improves the recognition
accuracy of tracking drift.
Figure 7 displays the SCR variation curve during target tracking. The blue curve
and bounding box represent the baseline algorithm, while the red curve and bounding
box represent our method. Tracking drift occurs, and the size of the target bounding box
gradually increases after the target is occluded starting from frame 1837. The blue curve
and target bounding box indicate that the baseline algorithm has entered a tracking drift
state. At frame 2148, when the SCR exceeds the threshold, it prompts our method to utilize
the block prediction to relocate the target. As a result, the target is successfully tracked at
frame 2343. This demonstrates the significance of the SCR metric in determining tracking
drift and the effectiveness of our method in resolving such situations.
Drones 2024, 8, 252 12 of 29

#1837 #2148 #2343 #4830

#1837 #2148 #2343

#4830

Figure 7. SCR curve diagram.

Tracking score: The tracking score is a direct measure of the quality of tracking results.
In Siamese networks, the tracking score is calculated by multiplying the classification confi-
dence and center-ness score. The classification confidence reflects the similarity between
the template and the search region, while the center-ness reflects the distance between the
target and the center point in the search image. The classification confidence scls can be
multiplied by the center-ness sctr to suppress the score for positively classified targets that
are far from the target center. The tracking score strc can be expressed as: strc = scls × sctr .
To suppress large variations in the target scale, the scale penalty function is used to penalize
large-scale variations in the target, and the process can be written as:
r r′ s s′
s∗ = strc × pn = strc × ek×max( r′ , r )×max( s′ , s ) (12)

where k is a hyperparameter. r represents the proposal’s ratio of height and width, and r ′
represents that of the last frame. s and s, represent the overall scale of the proposal and last
frame, respectively. s is calculated by:

( w + p ) × ( h + p ) = s2 (13)

where w and h denote the width and height of the target, respectively, and p is the padding
value. Additionally, to suppress scores away from the center, the response map is post-
processed using a cosine window function. The final tracking score TS can be expressed as:

TS = max (s∗ × (1 − d) + H × d) (14)

where d is a hyperparameter, and H denotes the cosine matrix.


Finally, a response map with a tracking score of 5 × 5 is obtained. The maximum
confidence score in the response map indicates a higher probability that the target is
present. If the target is affected by external factors, the maximum confidence score will
be lower. The tracker records tracking results with confidence scores below the threshold
and sends this information to the condition recognizer. The condition recognizer receives
three evaluation scores from the evaluation network and performs further processing in
the block-prediction module.
Algorithm 1 shows the details of the DTUN.
Drones 2024, 8, 252 13 of 29

Algorithm 1: Dynamic template-updating network.


Input: The first frame image x and the initial template z;
Output: The predicted target’s bounding box (xi , yi , wi , hi ) in the ith frame;
1 Crop the initial template z and search image x;
2 for i = 1, 2, . . . , n do
3 Extract feature φi (z) of template zi and feature φi ( x ) of image xi , respectively;
4 Calculate the classification scls , regression sreg and center-ness response map
sctr using Equation R∗ = concat( φ1 (z), . . . , φi (z))⋆ φ( x );
5 Calculate TS using Equation strc = scls × sctr , s∗ = strc × pn and
TS = maxs∗ × (1 − d) + H × d;
6 if TS > threshold(0.6) then
7 Storing the i-th frame tracking region in the feature memory;
8 end
9 Calculate the i-th frame APCE score using Equation
| Fmax − Fmin |2
APCE = !;
2
mean ∑ ( Fw,h − Fmin )
w,h

10 Based on the received APCE feedback, calculate the number T of templates for
Nmax −1
the i + 1st frame using Equation T = Nmax − 1+exp(− a( APCE−b))
;
11 if T == 1 then
12 Tcon = T1 ;
13 end
14 else
15 Extract high-quality and diverse templates from the feature memory using
Equations τi = ⌊ Nt ⌋ × i, Tj = maxrag τj , τj+1 and
 

Tcon = concat( T0 , . . . , TN −1 );
16 end
17 end

3.4. Block-Prediction Module


The block-prediction module comprises three parts: the condition recognizer, the
Kalman filter, and block search. Its purpose is to address the tracking drift problem that
arises from target occlusion and from targets moving out of the field of view. These
problems are difficult to solve with traditional trackers. If the target is completely occluded
and then reappears, tracking drift may occur, preventing the target from appearing in the
tracker’s search area and resulting in tracking failure. Similarly, if the target moves out
of view and then reappears, the tracker may not be able to locate the target in the search
area. These challenging situations can cause tracking drift and subsequent tracking failures.
Therefore, it is of great significance to solve the tracking drift problem in long-term tracking.
In our analysis of the search–evaluation network, we have determined the significance
of the APCE, SCR, and TS in identifying tracking drift. The status recognizer identifies
the tracking status for each frame during the tracking process. The Kalman filter is used
to predict the target trajectory in the subsequent frame if the target is occluded. The
block search is used to re-locate the target if the predictor fails to predict the target within
20 frames. The block search network has two modes: block search and expanded block
search. If the block search fails to locate the target, the expanded block search will be used.
Additionally, we check if the target moves out of the field of view during each frame’s
tracking process. If the target moves out of the field of view, the expanded block search
will be used to relocate it.

3.4.1. Motion Prediction


In the field of target tracking, most trackers do not include a target prediction process.
This can result in tracking drift when the target is disturbed by external factors, such as
Drones 2024, 8, 252 14 of 29

occlusion. This is particularly problematic for long-term tracking, as tracking drift can lead
to complete failure of subsequent tracking. Therefore, it is crucial to introduce a target
prediction process to address tracking drift caused by target occlusion. In airborne tracking,
due to the characteristics of distant sensors and small targets, the motion process before the
target is occluded and can be regarded as approximately linear.
The Kalman filter [6] is used in the target prediction stage. The state and observation
equations of the Kalman filter can be expressed as:

xk = Axk−1 + wk−1 (15)

zk = Hxk + vk (16)
where k denotes the moment of the kth frame, xk is the state vector, and zk is the observation
vector. H is the observation matrix, and H = I, where I is the unit matrix. A is the state
transition matrix. wk−1 and vk are the process error and observation error, respectively, and
are assumed to be subject to Gaussian distributions with covariance matrices Q and R. We
set Q = I × 0.1 and R = I. During the process of object tracking, the updating process
takes up most of the time, while the prediction process takes up very little time. Therefore,
it can be assumed that the size of the target will not change significantly in a short period.
The state space of the target is set as follows:

X = [ x, y, w, h, v x , vy ] T (17)

where x, y denote the center coordinates of the target, and w and h denote the width and
height of the target, respectively. v x and vy denote the change ratio of the center coordinates
of the target, respectively.
The Kalman filtering computational steps consist of the prediction process and the
update process. The prediction process focuses on predicting the state and error covariance
variables. It can be expressed as:
x̂k− = A x̂k−−1 (18)

p̂− T
k = A p̂k −1 A + Q (19)
where Pk−1 is the error covariance of the prediction for the k − 1st frame. The state update
phase includes the optimal estimation of the system state and the update of the error
covariance matrix. It can be expressed as:

x̂k = x̂k− + Kk (zk − H x̂k− ) (20)

pk = ( I − Kk H ) p−
k (21)

Kk = p− T − T
k H ( H pk H + R)
−1
(22)
where Kk is the Kalman filter gain of the kth frame. zk is the actual observation of the kth frame.

3.4.2. Block Search


The Kalman filter can effectively solve most target occlusion problems with near-
linear motion during target prediction. However, the motion trajectory of the target after
occlusion may sometimes exhibit significant nonlinearity. To deal with this situation, we
first predict the approximate motion direction of the target using the Kalman filter, and
then perform a block search to further locate the target. In target tracking from the UAV
perspective, the target is relatively small and moves slowly. Therefore, if the Kalman filter
fails to locate the target within 20 frames, we will use block search to further search for it.
Figure 8 illustrates the process of block search. Figure ➀ shows the process of target
prediction using Kalman filtering after the target is occluded. When the target is occluded,
its motion trajectory exhibits a large nonlinearity, which differs from the predicted trajectory,
resulting in prediction failure. Therefore, the block search is necessary for further target
searching. In Figure ➁, the red dot indicates the predicted position of the Kalman filter, and
Drones 2024, 8, 252 15 of 29

the green rectangular box indicates the original search area. The target is not within the
search area due to the small size of the search area. Therefore, in the block search module,
the search area will be expanded first. The process can be expressed as follows:
q
tiw−1
sw = × 289 × n
b
q w (23)
tih−1
sh = × 289 × n
bh
where sw and sh represent the width and height of the search area, respectively. n is the
magnification factor. In block search, the value of n is 3 when tracking drift occurs because
the object moves out of view, and 2 when tracking drift occurs in other situations. bw and
bh are the width and height of the initially sampled image, obtained as follows:

t1w t1
bw = , bh = h (24)
ts ts
q
t1w × t1h × p2
ts = (25)
289
where t1w and t1h are the width and height of the initial target, respectively. p is the search
region factor, typically set to 4. Due to the enlarged search region being larger than the
target, accurately localizing the target can be difficult if searching for it directly within the
region. Figure ➂ shows the response map results of searching in this way. As the response
map shows, this direct search makes target localization difficult. Therefore, as shown in
Figure ➃, we segment the enlarged search area into 3 × 3 blocks and search for the target in
each block. Block search effectively overcomes the challenge of accurately localizing small
targets in a larger search area. Figure ➄ shows the response map result of the block search,
which accurately localizes the target in the block image containing it. The target region has
the highest response value score, while the other blocks have significantly lower scores.
② Expand Image ③ Expand Response Map ⑥ target location

0 1 2
w

predicted S
0
trajectory
Motion
trajectory
(x0,y0)
1

(x1,y1)
① Target Prediction 2

(x',y')

h (Cw,Ch)
④ Block Image ⑤ Block Response Map

Figure 8. Block search module.

Figure ➅ illustrates the calculation of the target center position. The center coordinates
( x, y) of the target in the original search image are calculated as follows:

C
x = (cw − ⌊ ⌋) × S + x ′ − x1 + x0
√2 (26)
C ′
y = (ch − ⌊ ⌋) × S + y − y1 + y0
2
where C represents the number of block subimages. (cw , ch ) represents the coordinate
position of the block subimage where the target is located, with the horizontal and vertical
Drones 2024, 8, 252 16 of 29

axes being represented by w and h respectively. S denotes the width and height of the
square block subimage. ( x0 , y0 ) represents the coordinates of the center position of the
search area x0 = s2w , y0 = s2h . ( x1 , y1 ) represents the coordinates of the center position of
the block subimage where the maximum score of the target is located, x1 = y1 = S2 . ( x ′ , y′ )
represents the coordinates of the position with the maximum score in the block search
response map. The width w and height h of the target bounding box are calculated using
the following equations:
w = (1 − r ) wi −1 + r × w p
(27)
h = (1 − r ) h i −1 + r × h p
where w p and h p denote the width and height of the predicted target, respectively. wi−1
and hi−1 denote the width and height of the target in the previous frame, respectively. r is
derived by:
r = pn ( argmax {strc }) × max {strc } × q (28)
where q is a hyperparameter, strc is the tracking score, and pn represents the scale penalty
function.
Tracking drift can also be caused by targets moving out of view, which makes it difficult
for the tracker to localize the target due to the uncertainty in the center of the search area.
To address this issue, the enlarged block search method is utilized. To bring the search area
closer to the target’s reappearance, we calculate the average of the target’s historical trajectory
coordinates and use it as the center of the new search area. The search area is expanded to
three times the original size, and a 5 × 5 block search is used to search the target.
When the target moves out of the field of view, the tracker is unable to track the target
correctly. During this time, the tracker will perform expanded block search every 20 frames.
(It should be noted that the Kalman filter is not effective in acquiring the target when it
moves out of view or reappears. In this case, only block search will be employed, and
the Kalman filter will not be used.) It is not until the target reappears within the field of
view (e.g., the target changes its direction of motion or the UAV turns its camera towards
the target) that the expanded block search can correctly localize the target. Given the
uncertainty surrounding the moment of target reappearance, the expanded block search is
conducted every 20 frames, which avoids repeated search calculations and improves the
tracking efficiency.
Figure 9 shows the overlap rate graphs of the proposed method compared to the
baseline algorithm in the UAV20L long-term tracking dataset. The overlap ratio is defined
as the intersection-over-union ratio between the predicted target bounding box and the
ground truth. A low overlap ratio in the graph indicates tracking drift, which may be
caused by occlusion or the disappearance of the target. Tracking drift leads to an almost
complete failure of subsequent tracking, as shown by the blue curve in the figure. Our
method (red curve) effectively re-tracks the target after tracking drift, demonstrating its
effectiveness in solving the tracking drift problem.

(a) group2 (b) group3

Figure 9. Cont.
Drones 2024, 8, 252 17 of 29

(c) person7 (d) person14

Figure 9. Comparison results of overlap rates on the UAV20L long-term tracking dataset. The
red curve represents the method using motion prediction and block search, while the blue curve
represents the baseline method. The intermittent blank areas in the figure indicate cases where the
target disappears, resulting in no overlap rate values.

Algorithm 2 shows the complete details of the MPBTrack.

Algorithm 2: The proposed MPBTrack algorithm.


Input: The first frame image x and the initial template z;
Output: The predicted bounding box (xi , yi , wi , hi ) of the target in the ith frame;
1 Crop the initial template region and the first frame search image;
2 for i = 1, 2, . . . , n do
3 Extract feature φi (z) of template zi and feature φi ( x ) of image xi , respectively;
4 Feed concat( φ1 (z), . . . , φi (z)) and φi ( x ) into the search network;
5 Output the classification response map scls , regression response map sreg and
center-ness response map sctr ;
6 Feed scls , sreg and sctr into evaluation network;
7 Output the scores APCE, SCR and TS;
8 Recognize tracking status by using joint indicators APCE, SCR and TS;
9 if Tracking drift occurs then
10 Predicting the motion of the target using Equations (18) and (19);
11 if Failed to predict within 20 frames then
12 Calculate the search region size using Equations (23)–(25);
13 Segment the search region into m × m blocks;
14 for block subimage = 1,2, . . . , m2 do
15 Calculate the TS for each block;
16 end
17 Determine the block subimage where the maxTS is located;
18 Calculate the center coordinates ( x, y), width wand height h of the
target using Equations (26) and (27);
19 end
20 end
21 else if Target moved out of view then
22 Implementation steps 10–18;
23 end
24 else
25 Update target state using Equations (20)–(22);
26 end
27 end
Drones 2024, 8, 252 18 of 29

4. Experiments
This section presents the experimental validation of the MPBTrack algorithm. Section 4.1
describes the details of model training and experimental evaluation. In Section 4.2, we per-
form ablation experiments for analysis. Sections 4.3–4.7 present quantitative experimental
results, and Section 4.8 presents a qualitative experimental analysis.

4.1. Experimental Details


The experiments were conducted on an Intel(R) Xeon(R) Silver 4110 2.10 GHz CPU
and NVIDIA GeForce RTX2080 GPU platform. We use GoogLeNet [31] as the backbone.
The algorithmic model is trained on the TrackingNet [33], LaSOT [34], and GOT-10k [35]
training sets, as well as the ILSVRC VID [36], ILSVRC DET [36], and COCO [37] datasets.
The model was trained with 20 epochs using the SGD optimizer. The learning rate increased
from 1 × 10−2 to 8 × 10−2 with a warmup technology in the first epoch and then decreased
from 8 × 10−2 to 1 × 10−6 with a cosine annealing learning rate schedule. The momentum
and weight decay rate were set to 0.9 and 1 × 10−4 , respectively. During the inference
phase, the block subimages were uniformly cropped to 289 × 289 to be fed into the feature
extraction network. The block search was performed every 20 frames after tracking drift,
while the expanded block search was performed every 15 frames.
To demonstrate the effectiveness of our approach in tracking an object on UAV plat-
forms, quantitative and qualitative experiments were conducted on five aerial object track-
ing datasets: UAV20L [15], UAV123 [15], UAVDT [16], DTB70 [17], and VisDrone2018-
SOT [18]. The evaluation of the experimental results included overall performance as-
sessment and evaluation of the challenge attributes. The metrics used to evaluate the
experiment were success and precision. Success is expressed as the area under curve (AUC)
plotted as the overlap at different thresholds. The overlap is the ratio of the intersection
over union (IoU) between the predicted bounding box region A and the ground truth
| A∩ B|
region B. It can be expressed as Overlap = | A∪ B| , where |.| denotes the area of the region.
Due to the potential for inconsistent evaluation results with different overlap thresholds,
AUC was used to rank the success scores of the trackers. Precision measures the distance
between the center position of the predicted bounding box and the center position of the
ground truth. Traditionally, this distance threshold is set to 20 pixels.

4.2. Ablation Experiments


To validate the effectiveness of the proposed modules, we performed ablation exper-
iments on the UAV123 and UAV20L datasets. Our tracking framework comprises two
parts: the dynamic template-updating network (DTUN) and the block-prediction module
(BPM). We added DTUN, BPM, and DTUN+BPM to the baseline algorithm to validate the
effectiveness of each module separately.
Table 1 presents the experimental results for the UAV20L and UAV123 datasets. Our
method demonstrates a 16.5% and 18.9% improvement in success and precision, respec-
tively, compared to the baseline on the UAV20L dataset. Specifically, DTUN shows a 4.4%
and 5.3% improvement in success and precision, respectively, while BPM shows a 14.3%
and 16.0% improvement in success and precision, respectively. The ablation experiments
demonstrate that both the proposed DTUN and BPM achieve excellent performance im-
provement. Our method provides a significant improvement in solving the tracking drift
problem in long-term airborne object tracking.
The target in the UAV view is small and moves slowly over a short period. To avoid
reduced tracking speed, the block search should not be used for every frame when tracking
drift occurs. Instead, appropriate time intervals should be used to balance the success
rate and speed. Table 2 shows the changes in tracking performance and speed using
different time intervals. Tracking speed may decrease without improving performance if a
smaller time interval is used. Conversely, using a larger time interval can decrease tracking
performance. The decrease in tracking speed may be due to unresolved tracking drift. The
Drones 2024, 8, 252 19 of 29

experimental analysis determined that the frame interval suitable for tracking in UAVs
is 20.

Table 1. Analysis of ablation experiments using DTUN and BPM on the UAV123 and UAV20L
datasets.

UAV20L UAV123
Module
Success Precision Success Precision
Baseline 0.589 0.742 0.647 0.825
Baseline + DTUN 0.617 0.784 0.651 0.834
Baseline + BPM 0.683 0.875 0.651 0.841
Baseline + DTUN + BPM 0.706 0.905 0.656 0.842

Table 2. Ablation experiments with different search intervals.

UAV20L VisDrone2018-SOT
Interval Frames
Success Precision FPS Success Precision FPS
10 0.697 0.893 32.4 0.637 0.822 27.9
15 0.700 0.898 32.6 0.637 0.823 28.3
20 0.706 0.905 43.5 0.665 0.864 43.4
25 0.689 0.811 32.8 0.633 0.818 26.7

Section 3.3.2 discusses the importance of the SCR metric in determining tracking
drift conditions in UAV video object tracking. Choosing a suitable SCR threshold can
improve the accuracy of recognizing tracking conditions. Table 3 demonstrates the impact
of different SCR values on tracking performance. The table shows that choosing a threshold
that is too large or too small results in misjudging the tracking drift condition and further
leads to the degradation of tracking performance.

Table 3. Experimental results for various SCR thresholds on the UAV20L dataset.

SCR 2.0 2.4 2.8 3.2 3.6


Success 0.696 0.704 0.706 0.695 0.688
Precision 0.892 0.901 0.905 0.890 0.882

4.3. Experiments on UAV20L Benchmark


UAV20L [15] is a long-term object tracking dataset, which contains 20 long-term se-
quences with an average of 2933 frames per sequence. The dataset comprises approximately
58k frames, with the longest sequence consisting of 5527 frames. It presents a significant
challenge for the tracker’s long-term tracking capabilities. The challenging attributes of
UAV20L include aspect ratio change (ARC), background clutter (BC), camera motion (CM),
fast motion (FM), full occlusion (FO), illumination variation (IV), low resolution (LS), out of
view (OV), partial occlusion (PO), scale variation (SV), similar object (SO), and viewpoint
change (VC). Precision and success serve as evaluation metrics for this dataset.
Figure 10 shows a comparison of our method with other competitive methods, such
as TaMOs [38], ARTrack [39], RTS [22], SLT-TransT [40], OSTrack [26], and TransT [41]
based on Transformer architecture, and STMTrack [9], SiamBAN [42], SiamCAR [21], and
SiamGAT [10] based on Siamese network derivation. Our method outperforms the baseline
STMTrack by improving the success by 19.1% and improving the precision by 20.8%. In
terms of success, our method surpasses the second-ranked OSTrack by 2.3% and achieves
faster tracking speeds. The results demonstrate the superior tracking robustness and
accuracy of our method for long-term tracking on UAVs, and the tracking speed far exceeds
the real-time requirements.
Figures 11 and 12 show the success and precision plots for attribute evaluation, respec-
tively. The proposed method achieves excellent performance in both attribute evaluations.
Drones 2024, 8, 252 20 of 29

In terms of success, the scores are ARC (0.683), BC (0.613), CM (0.698), FO (0.599), IV (0.663),
OV (0.703), PO (0.689), SV (0.706), and VC (0.742); in terms of precision, the scores are ARC
(0.881), BC (0.870), CM (0.900), FO (0.857), IV (0.856), OV (0.898), PO (0.894), SV (0.900),
and VC (0.906). The attribute evaluations demonstrate the remarkable performance of
the proposed DTUN and BPM in dealing with various complex situations, proving the
effectiveness of our approach.

Success plots of OPE on UAV20L Precision plots of OPE on UAV20L


0.9 0.9

0.8 0.8

0.7 0.7
[0.706] MPBTrack [0.905] MPBTrack
[0.690] OSTrack-256 [0.904] OSTrack-256
0.6 [0.682] RTS 0.6 [0.882] RTS
Success rate

[0.677] SLT-TransT [0.877] SLT-TransT

Precision
0.5 [0.673] ARTrack 0.5 [0.874] ARTrack
[0.653] PrDiMP-50 [0.851] PrDiMP-50
[0.606] SiamGAT [0.829] TaMOs
0.4 0.4
[0.597] TaMOs [0.777] SiamGAT
[0.593] STMTrack [0.776] TransT50
0.3 [0.587] TransT50 0.3 [0.749] STMTrack
[0.569] SiamBAN [0.739] SiamBAN
0.2 [0.551] SiamCAR [0.729] SiamCAR
0.2
[0.518] SiamAPN [0.692] SiamAPN
[0.514] TCTrack [0.690] TCTrack
0.1 [0.444] Ocean 0.1 [0.630] Ocean
[0.309] LightTrack [0.414] LightTrack
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold

(a) Success (b) Precision


Figure 10. Success and precision plot in UAV20L dataset.
Success plots of OPE - Aspect Ratio Change Success plots of OPE - Background Clutters Success plots of OPE - Camera Motion
0.9
0.9
0.8
0.8 0.8
0.7
0.7 0.7
[0.683] MPBTrack [0.613] MPBTrack [0.698] MPBTrack
[0.666] OSTrack-256 0.6 [0.565] OSTrack-256 [0.682] OSTrack-256
0.6 0.6
[0.658] RTS [0.503] SLT-TransT [0.673] RTS
Success rate

Success rate

Success rate

[0.651] SLT-TransT 0.5 [0.502] PrDiMP-50 [0.668] SLT-TransT


0.5 [0.646] ARTrack [0.482] TaMOs 0.5 [0.664] ARTrack
[0.620] PrDiMP-50 [0.478] RTS [0.643] PrDiMP-50
0.4 [0.567] TaMOs 0.4 [0.475] ARTrack [0.595] SiamGAT
0.4
[0.560] SiamGAT [0.463] SiamGAT [0.584] TaMOs
[0.540] TransT50 [0.375] SiamCAR [0.579] STMTrack
0.3 0.3
[0.540] STMTrack [0.286] TransT50 0.3 [0.574] TransT50
[0.511] SiamBAN [0.285] STMTrack [0.552] SiamBAN
0.2 [0.495] SiamCAR 0.2 [0.268] TCTrack 0.2 [0.535] SiamCAR
[0.456] TCTrack [0.268] Ocean [0.500] SiamAPN
[0.456] SiamAPN [0.240] SiamBAN [0.498] TCTrack
0.1 [0.360] Ocean 0.1 [0.214] LightTrack 0.1 [0.424] Ocean
[0.322] LightTrack [0.204] SiamAPN [0.297] LightTrack
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(a) ARC (b) BC (c) CM


Success plots of OPE - Full Occlusion Success plots of OPE - Illumination Variation Success plots of OPE - Out-of-View
0.9
0.8 0.9

0.8
0.7 0.8

0.7
0.6 0.7
[0.599] MPBTrack [0.687] OSTrack-256 [0.703] MPBTrack
[0.542] OSTrack-256 [0.663] MPBTrack 0.6 [0.687] SLT-TransT
[0.525] RTS 0.6 [0.651] RTS [0.686] ARTrack
0.5
Success rate

Success rate

Success rate

[0.519] ARTrack [0.636] ARTrack [0.676] OSTrack-256


[0.515] SLT-TransT [0.624] SLT-TransT 0.5 [0.669] RTS
0.5
0.4 [0.490] TaMOs [0.609] PrDiMP-50 [0.653] PrDiMP-50
[0.480] PrDiMP-50 [0.582] TaMOs 0.4 [0.642] SiamGAT
[0.457] SiamGAT 0.4 [0.564] TransT50 [0.610] STMTrack
0.3 [0.394] SiamCAR [0.545] STMTrack [0.607] TaMOs
[0.343] STMTrack 0.3 [0.518] SiamGAT 0.3 [0.593] TransT50
[0.333] TransT50 [0.518] SiamBAN [0.571] SiamBAN
0.2 [0.307] SiamBAN [0.500] SiamCAR [0.554] SiamAPN
0.2 0.2
[0.295] TCTrack [0.498] TCTrack [0.534] TCTrack
[0.278] Ocean [0.400] SiamAPN [0.512] SiamCAR
0.1 0.1
[0.265] SiamAPN 0.1 [0.395] Ocean [0.461] Ocean
[0.210] LightTrack [0.363] LightTrack [0.269] LightTrack
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(d) FO (e) IV (f) OV

Figure 11. Cont.


Drones 2024, 8, 252 21 of 29

Success plots of OPE - Partial Occlusion Success plots of OPE - Scale Variation Success plots of OPE - Viewpoint Change
0.9 0.9
0.9

0.8 0.8
0.8

0.7 0.7
[0.689] MPBTrack [0.706] MPBTrack 0.7 [0.742] MPBTrack
[0.673] OSTrack-256 [0.690] OSTrack-256 [0.715] ARTrack
0.6 0.6
[0.663] RTS [0.680] RTS 0.6 [0.715] OSTrack-256

Success rate

Success rate

Success rate
[0.662] ARTrack [0.676] SLT-TransT [0.713] SLT-TransT
0.5 [0.660] SLT-TransT 0.5 [0.672] ARTrack 0.5 [0.701] RTS
[0.634] PrDiMP-50 [0.650] PrDiMP-50 [0.695] PrDiMP-50
[0.604] SiamGAT [0.600] SiamGAT [0.624] TaMOs
0.4 0.4 0.4
[0.574] TaMOs [0.592] TaMOs [0.618] SiamGAT
[0.565] TransT50 [0.587] STMTrack [0.593] STMTrack
0.3 [0.563] STMTrack 0.3 [0.581] TransT50 0.3 [0.587] TransT50
[0.539] SiamBAN [0.562] SiamBAN [0.562] SiamBAN
0.2 [0.526] SiamCAR 0.2 [0.546] SiamCAR [0.532] SiamAPN
0.2
[0.488] SiamAPN [0.510] SiamAPN [0.517] SiamCAR
[0.485] TCTrack [0.506] TCTrack [0.517] TCTrack
0.1 [0.417] Ocean 0.1 [0.430] Ocean 0.1 [0.418] Ocean
[0.285] LightTrack [0.322] LightTrack [0.337] LightTrack
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(g) PO (h) SV (i) VC

Figure 11. Success plot for attribute evaluation on the UAV20L dataset.

Precision plots of OPE - Aspect Ratio Change Precision plots of OPE - Background Clutters Precision plots of OPE - Camera Motion
0.9
0.9 0.9

0.8 0.8
0.8

0.7 0.7
0.7
[0.881] MPBTrack [0.870] MPBTrack [0.900] MPBTrack
[0.881] OSTrack-256 0.6 [0.809] OSTrack-256 [0.899] OSTrack-256
0.6 [0.853] RTS [0.772] TaMOs 0.6 [0.876] RTS
[0.847] SLT-TransT [0.744] PrDiMP-50 [0.871] SLT-TransT
Precision

Precision

Precision
0.5 [0.843] ARTrack 0.5 [0.721] SLT-TransT 0.5 [0.868] ARTrack
[0.817] PrDiMP-50 [0.711] RTS [0.843] PrDiMP-50
[0.794] TaMOs 0.4 [0.671] ARTrack [0.820] TaMOs
0.4 0.4
[0.722] SiamGAT [0.647] SiamGAT [0.765] SiamGAT
[0.721] TransT50 [0.576] SiamCAR [0.764] TransT50
0.3 [0.687] STMTrack 0.3 [0.464] TCTrack 0.3 [0.736] STMTrack
[0.675] SiamBAN [0.439] TransT50 [0.726] SiamBAN
0.2 [0.662] SiamCAR 0.2 [0.424] STMTrack [0.715] SiamCAR
0.2
[0.616] SiamAPN [0.407] Ocean [0.676] SiamAPN
[0.613] TCTrack [0.392] SiamBAN [0.674] TCTrack
0.1 [0.538] Ocean 0.1 [0.372] LightTrack 0.1 [0.610] Ocean
[0.439] LightTrack [0.367] SiamAPN [0.399] LightTrack
0.0 0.0 0.0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Location error threshold Location error threshold Location error threshold

(a) ARC (b) BC (c) CM


Precision plots of OPE - Full Occlusion Precision plots of OPE - Illumination Variation Precision plots of OPE - Out-of-View
0.9 0.9
0.8
0.8 0.8
0.7
0.7 0.7
[0.857] MPBTrack [0.903] OSTrack-256 [0.898] MPBTrack
0.6 [0.797] OSTrack-256 [0.856] MPBTrack [0.887] SLT-TransT
[0.789] TaMOs 0.6 [0.842] RTS 0.6 [0.880] ARTrack
[0.770] RTS [0.834] ARTrack [0.878] OSTrack-256
Precision

Precision

Precision

0.5 [0.750] SLT-TransT [0.807] SLT-TransT [0.858] RTS


0.5 0.5
[0.745] ARTrack [0.798] TaMOs [0.851] PrDiMP-50
0.4 [0.731] PrDiMP-50 [0.794] PrDiMP-50 [0.839] TaMOs
0.4 0.4
[0.670] SiamGAT [0.742] TransT50 [0.818] SiamGAT
[0.614] SiamCAR [0.684] STMTrack [0.781] TransT50
0.3 [0.523] TransT50 0.3 [0.677] SiamBAN 0.3 [0.771] STMTrack
[0.513] STMTrack [0.669] SiamCAR [0.747] SiamBAN
0.2 [0.508] TCTrack 0.2 [0.661] TCTrack 0.2 [0.736] SiamAPN
[0.494] SiamBAN [0.658] SiamGAT [0.701] TCTrack
[0.484] Ocean [0.578] Ocean [0.677] SiamCAR
0.1 [0.465] SiamAPN 0.1 [0.534] SiamAPN 0.1 [0.656] Ocean
[0.377] LightTrack [0.449] LightTrack [0.341] LightTrack
0.0 0.0 0.0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Location error threshold Location error threshold Location error threshold

(d) FO (e) IV (f) OV


Precision plots of OPE - Partial Occlusion Precision plots of OPE - Scale Variation Precision plots of OPE - Viewpoint Change
0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7


[0.894] MPBTrack [0.900] MPBTrack [0.906] MPBTrack
[0.894] OSTrack-256 [0.899] OSTrack-256 [0.890] OSTrack-256
0.6 [0.870] RTS 0.6 [0.876] RTS 0.6 [0.890] ARTrack
[0.867] SLT-TransT [0.871] SLT-TransT [0.878] SLT-TransT
Precision

Precision

Precision

0.5 [0.864] ARTrack 0.5 [0.868] ARTrack 0.5 [0.867] PrDiMP-50


[0.841] PrDiMP-50 [0.844] PrDiMP-50 [0.857] RTS
[0.812] TaMOs [0.820] TaMOs [0.811] TaMOs
0.4 [0.780] SiamGAT 0.4 [0.766] SiamGAT 0.4 [0.754] SiamGAT
[0.754] TransT50 [0.764] TransT50 [0.739] TransT50
0.3 [0.721] STMTrack 0.3 [0.736] STMTrack 0.3 [0.706] STMTrack
[0.712] SiamBAN [0.726] SiamBAN [0.692] SiamBAN
[0.707] SiamCAR [0.715] SiamCAR [0.663] SiamAPN
0.2 0.2 0.2
[0.667] TCTrack [0.676] SiamAPN [0.640] SiamCAR
[0.666] SiamAPN [0.674] TCTrack [0.621] TCTrack
0.1 [0.601] Ocean 0.1 [0.611] Ocean 0.1 [0.577] Ocean
[0.389] LightTrack [0.434] LightTrack [0.410] LightTrack
0.0 0.0 0.0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Location error threshold Location error threshold Location error threshold

(g) PO (h) SV (i) VC

Figure 12. Precision plot for attribute evaluation on the UAV20L dataset.

4.4. Experiments on UAV123 Benchmark


UAV123 [15] consists of 123 low-altitude video sequences captured by UAVs, with a
total of more than 110K frames. This dataset poses significant challenges for object tracking
due to its inclusion of numerous video sequences with complex conditions. Table 4 outlines
Drones 2024, 8, 252 22 of 29

the results of the comparison experiments between our method and other competitors, such
as TaMOs [38], HiFT [25], STMTrack [9], SiamPW-RBO [43], and LightTrack [44]. Compared
with the baseline method, our method delivers competitive performance, improving success
and precision by 1.4% and 2.1%, respectively.

Table 4. Experimental results comparing our method with other methods on the UAV123 dataset.
Trackers are ranked based on their success scores.

Ta- TC- PAC- Siam- Light- CNN- Siam- Siam- Auto- SiamPW- Siam- STM- MPB-
Tracker
MOs [38] HiFT [25] Track [27] Net [45] CAR [21] Track [44] InMO [46] BAN [42] RN [11] Match [47] RBO [43] GAT [10] Track [9] Track
Succ. 0.571 0.589 0.604 0.620 0.623 0.626 0.629 0.631 0.643 0.644 0.645 0.646 0.647 0.656
Prec. 0.791 0.787 0.800 0.827 0.813 0.809 0.818 0.833 - - - 0.843 0.825 0.842

4.5. Experiments on UAVDT Benchmark


UAVDT [16] is a dataset for object tracking and detection captured by UAVs, which
contains 50 video sequences with moving vehicles as its targets of interest. This dataset
encompasses various challenges such as long-term tracking (LT), large occlusion (LO),
object blur (OB), small object (SO), background clutter (BC), camera rotation (CR), object
motion (OM), camera motion (CM), illumination variation (IV), and scale variation (SV).
Figure 13 illustrates the experimental results of our method in comparison with
other methods (TaMOs [38], ARTrack [39], DropTrack [12], ROMTrack [48], TransT [41],
LightTrack [44], SiamCAR [21], and STMTrack [9]) on the UAVDT dataset. Our approach
demonstrates superior performance compared to other Transformer-based large model
structures and Siamese network trackers. Compared to the baseline STMTrack, our method
shows an improvement of 2.0% and 1.2% in terms of success and precision, respectively.
Figure 14 shows the success plots for attribute evaluation. Our method secures excellent
success scores across various challenging attributes, including BC (0.608), CM (0.651), IV
(0.695), LO (0.615), LT (0.767), OB (0.677), OM (0.687), SV (0.678), and SO (0.675). The
evaluation results highlight the excellent performance of our method in adapting to various
complex conditions on UAVs.

Success plots of OPE on UAVDT Precision plots of OPE on UAVDT


0.9 0.9

0.8 0.8

0.7 0.7
[0.673] MPBTrack [0.872] MPBTrack
[0.660] STMTrack [0.862] STMTrack
0.6 0.6
[0.638] ARTrack [0.852] ARTrack
Success rate

[0.618] ROMTrack [0.831] ROMTrack


Precision

0.5 [0.609] TransT 0.5 [0.826] TransT


[0.603] SiamBAN [0.806] SiamBAN
[0.600] SiamCAR [0.804] SiamCAR
0.4 0.4
[0.592] LightTrack [0.776] LightTrack
[0.590] DropTrack [0.772] DropTrack
0.3 [0.576] SiamGAT 0.3 [0.758] SiamAPN++
[0.551] SiamAPN++ [0.754] SiamGAT
0.2 [0.525] Ocean 0.2 [0.725] Ocean
[0.518] SiamAPN [0.710] SiamAPN
[0.517] TCTrack [0.697] TCTrack
0.1 [0.407] SESiamFC 0.1 [0.626] SESiamFC
[0.350] TaMOs-R50 [0.523] TaMOs-R50
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold

(a) Success (b) Precision


Figure 13. Success and precision plots on the UAVDT dataset.
Drones 2024, 8, 252 23 of 29

Success plots of OPE - background clutter Success plots of OPE - camera motion Success plots of OPE - illumination variations
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6 [0.608] MPBTrack [0.651] MPBTrack [0.695] MPBTrack
[0.586] STMTrack 0.6 [0.641] STMTrack [0.682] STMTrack
[0.571] ARTrack [0.606] TransT 0.6 [0.662] ARTrack

Success rate

Success rate

Success rate
0.5 [0.544] ROMTrack [0.601] ARTrack [0.639] ROMTrack
[0.527] SiamBAN 0.5 [0.577] ROMTrack [0.637] SiamBAN
0.5
[0.520] TransT [0.574] SiamCAR [0.624] SiamCAR
0.4 [0.516] LightTrack [0.567] DropTrack [0.608] LightTrack
0.4 0.4
[0.513] DropTrack [0.563] SiamBAN [0.603] DropTrack
0.3 [0.499] SiamCAR [0.563] SiamGAT [0.597] SiamAPN++
[0.490] SiamGAT 0.3 [0.545] LightTrack 0.3 [0.591] TransT
[0.465] SiamAPN++ [0.520] SiamAPN++ [0.586] Ocean
0.2 [0.457] Ocean 0.2 [0.519] TCTrack 0.2 [0.574] SiamGAT
[0.425] TCTrack [0.506] Ocean [0.560] TCTrack
[0.391] SiamAPN [0.452] SiamAPN [0.522] SiamAPN
0.1 [0.335] SESiamFC 0.1 [0.375] SESiamFC 0.1 [0.438] SESiamFC
[0.327] TaMOs-R50 [0.358] TaMOs-R50 [0.376] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(a) BC (b) CM (c) IV


Success plots of OPE - large occlusion Success plots of OPE - long-term tracking Success plots of OPE - object blur
1.0 0.9
0.8
0.9
0.8
0.7
0.8
0.7
0.6 [0.615] MPBTrack [0.794] STMTrack [0.677] MPBTrack
[0.599] TransT 0.7 [0.767] MPBTrack [0.666] STMTrack
[0.598] STMTrack [0.759] SiamCAR 0.6 [0.634] ARTrack
Success rate

Success rate

Success rate
0.5 [0.589] ARTrack 0.6 [0.746] SiamGAT [0.619] ROMTrack
[0.568] ROMTrack [0.736] SiamBAN 0.5 [0.596] SiamBAN
[0.558] DropTrack 0.5 [0.725] TransT [0.594] SiamCAR
0.4 [0.547] LightTrack [0.725] TCTrack [0.591] TransT
[0.527] SiamGAT [0.722] ARTrack 0.4 [0.577] DropTrack
0.4
0.3 [0.527] SiamBAN [0.716] DropTrack [0.560] SiamAPN++
[0.509] SiamCAR [0.710] LightTrack 0.3 [0.558] SiamGAT
[0.456] SiamAPN++ 0.3 [0.701] ROMTrack [0.555] LightTrack
0.2 [0.428] TCTrack [0.694] SiamAPN [0.533] SiamAPN
0.2
[0.426] Ocean 0.2 [0.692] SiamAPN++ [0.529] Ocean
[0.425] SiamAPN [0.617] Ocean [0.517] TCTrack
0.1 [0.390] TaMOs-R50 [0.525] SESiamFC 0.1 [0.404] SESiamFC
0.1
[0.325] SESiamFC [0.417] TaMOs-R50 [0.313] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(d) LO (e) LT (f) OB

Success plots of OPE - object motion Success plots of OPE - scale variations Success plots of OPE - small object

0.9 0.9 0.9

0.8 0.8 0.8

0.7 [0.687] MPBTrack 0.7 [0.678] MPBTrack 0.7 [0.675] MPBTrack


[0.665] STMTrack [0.667] STMTrack [0.656] STMTrack
0.6 [0.645] ARTrack 0.6 [0.657] ARTrack 0.6 [0.605] ARTrack
Success rate

Success rate

Success rate
[0.625] ROMTrack [0.650] ROMTrack [0.585] TransT
0.5 [0.610] TransT [0.629] TransT [0.578] SiamGAT
0.5 0.5
[0.585] SiamCAR [0.622] DropTrack [0.578] SiamBAN
[0.578] SiamBAN [0.619] LightTrack [0.577] ROMTrack
0.4 [0.573] LightTrack 0.4 [0.617] SiamBAN 0.4 [0.567] SiamCAR
[0.569] DropTrack [0.617] SiamCAR [0.556] DropTrack
0.3 [0.563] SiamGAT 0.3 [0.575] SiamGAT 0.3 [0.542] SiamAPN++
[0.520] SiamAPN++ [0.559] SiamAPN++ [0.537] TCTrack
[0.492] SiamAPN [0.529] SiamAPN [0.532] Ocean
0.2 0.2 0.2
[0.477] Ocean [0.507] Ocean [0.527] LightTrack
[0.477] TCTrack [0.495] TCTrack [0.522] SiamAPN
0.1 [0.365] TaMOs-R50 0.1 [0.401] TaMOs-R50 0.1 [0.415] SESiamFC
[0.329] SESiamFC [0.386] SESiamFC [0.271] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold

(g) OM (h) SV (i) SO

Figure 14. Success plots for attribute evaluation on the UAVDT dataset.

4.6. Experiments on DTB70 Benchmark


DTB70 [17] dataset is a highly diverse dataset that consists of 70 videos captured by
UAVs. Figure 15 shows the success and precision results of the proposed method on the
DTB70 dataset. Our method achieves an excellent score of 0.670 in terms of success and
outperforms the baseline method by 2.3% and 2.3% in terms of success and precision. Com-
pared to Transformer architecture-based methods such as TaMOs [38], ROMTrack [48], AR-
Track [39], TransT [41], and OSTrack [26], our method achieves better tracking results with-
out requiring large model parameters and computational resources. Similarly, compared
to Siamese network-based methods such as STMTrack [9], SiamAPN [49], SiamGAT [10],
and SiamCAR [21], our method achieves significant performance improvement in terms
of success and precision. In summary, our Siamese network-based tracker achieves a
higher success score than the Transformer-based tracker, and it demonstrates significant
competitiveness in tracking from UAVs.
Drones 2024, 8, 252 24 of 29

Success plots of OPE on DTB70 Precision plots of OPE on DTB70

0.9 0.9

0.8 0.8
[0.670] MPBTrack [0.877] ARTrack
0.7 0.7
[0.666] OSTrack [0.867] ROMTrack
[0.664] ARTrack [0.866] OSTrack
0.6 [0.661] ROMTrack 0.6 [0.847] KeepTrack

Success rate
[0.655] KeepTrack [0.844] MPBTrack

Precision
0.5 [0.655] STMTrack 0.5 [0.836] TransT
[0.645] TransT [0.832] SiamBAN
[0.643] SiamBAN [0.831] SiamCAR
0.4 [0.622] TCTrack 0.4 [0.825] STMTrack
[0.603] SiamCAR [0.813] TCTrack
0.3 [0.594] SiamAPN++ 0.3 [0.790] SiamAPN++
[0.587] LightTrack [0.784] SiamAPN
[0.586] SiamAPN [0.761] LightTrack
0.2 0.2
[0.583] SiamGAT [0.752] SiamGAT
[0.548] TaMOs-R50 [0.735] TaMOs-R50
0.1 [0.490] SESiamFC 0.1 [0.730] SESiamFC
[0.455] Ocean [0.634] Ocean
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold

(a) Success (b) Precision


Figure 15. Success and precision plots on the DTB70 dataset.

4.7. Experiments on VisDrone2018-SOT Benchmark


VisDrone2018-SOT [18] contains 35 video sequences totaling approximately 29k frames.
Figure 16 shows the success rate plot and accuracy plot of some competitive trackers (e.g.,
TaMOs-R50 [38], ARTrack [39], ToMP-101 [50], CNNInMo [46], and STMTrack [9]) on the
VisDrone2018-SOT dataset. Our method outperforms the baseline method by 5.6% and
4.9% in terms of success and precision, achieving significant performance improvements.
The success vs. speed and precision vs. speed results of the trackers are shown in Figure 17.
Compared to the Transformer architecture-based and Siamese network-based approaches,
our method not only achieves leading results in terms of success, but also achieves a
real-time speed of 43 FPS, which is a great advantage for object tracking from UAVs.

Success plots of OPE on VisDrone2018-SOT Precision plots of OPE on VisDrone2018-SOT


0.9 0.9

0.8 0.8

0.7 0.7
[0.665] MPBTrack [0.874] ARTrack
[0.659] ARTrack [0.864] MPBTrack
0.6 [0.647] TransT 0.6 [0.859] TransT
Success rate

[0.646] OSTrack [0.846] ToMP-101


Precision

0.5 [0.644] ToMP-101 0.5 [0.840] RTS-50


[0.630] STMTrack [0.840] OSTrack
[0.630] KeepTrack [0.829] KeepTrack
0.4 0.4
[0.624] CNNInMo [0.824] STMTrack
[0.622] RTS-50 [0.818] CNNInMo
0.3 [0.606] SiamGAT 0.3 [0.810] SiamGAT
[0.589] SLT-SiamRPN++ [0.787] SLT-SiamRPN++
0.2 [0.584] SLT-SiamAttn 0.2 [0.785] TCTrack
[0.579] TCTrack [0.770] SLT-SiamAttn
[0.568] LightTrack [0.757] SiamBAN
0.1 [0.559] SiamBAN 0.1 [0.748] LightTrack
[0.401] TaMOs-R50 [0.542] TaMOs-R50
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold

(a) Success (b) Precision


Figure 16. Success and precision plots on the VisDrone2018-SOT dataset.
Drones 2024, 8, 252 25 of 29

Success VS. Speed on VisDrone2018-SOT Precision VS. Speed on VisDrone2018-SOT


0.90
0.68
MPBTrack 0.88 ARTrack
ARTrack MPBTrack
0.66 TransT
TransT OSTrack 0.86
ToMP-101 ToMP-101
RTS-50 OSTrack
0.64 KeepTrack STMTrack 0.84 KeepTrack
RTS-50 CNNInMo STMTrack
CNNInMo

Precision
Success
0.62 0.82 SiamGAT
SiamGAT

0.60 0.80
SLT-SiamRPN++ SLT-SiamRPN++
SLT-SiamAttn
0.78 SLT-SiamAttn
0.58
LightTrack
SiamBAN
SiamBAN 0.76 LightTrack
0.56
0.74
0.54
0 10 20 30 40 50 0 10 20 30 40 50
FPS FPS

(a) Success vs. FPS (b) Precision vs. FPS


Figure 17. The success vs. speed and precision vs. speed plots on the VisDrone2018-SOT dataset.

4.8. Qualitative Analysis


In order to visually compare the tracking performance of the proposed method, a
qualitative comparison with the ground truth as well as other methods (e.g., ARTrack [39],
TCTrack [27], TransT [41], SiamCAR [21], STMTrack [9]) is performed. The study analyzes
seven sequences from UAV20L and VisDrone2018-SOT datasets. Figure 18 shows the
sequences in the following order from top to bottom: car1, group2, person14, uav180, uav1,
group1, and uav93.
(1) car 1: This sequence presents two challenges: occlusion of the object and the object
moving out of view. Some trackers experience tracking drift after the object is oc-
cluded, and more trackers experience tracking drift after the target moves out of
the field of view. Our method, however, successfully tracks the object again after
both challenges.
(2) group2, person14, uav180: These three sequences present the challenge of object
occlusion. The visualization results demonstrate that when the object is occluded,
only our tracker successfully tracks it, while the other trackers experience tracking
drift or track the wrong object for a prolonged period in the subsequent frames.
This highlights the significant advantages of our tracker in long-term tracking and
handling challenging situations.
(3) uav1: The uav1 sequence involves the challenges of camera motion, background
clutter, and fast object motion. The simultaneous interference of these three challenges
in the tracking of this uav1 sequence leads to tracking drift in multiple trackers.
However, our tracker relies on dynamic template updating and block search to remain
relatively resistant to interference from complex conditions.
(4) group1 and uav93: These two sequences present challenges with similar targets
and object occlusion. When mutual occlusion between objects occurs, other trackers
appear to track the wrong object. Our tracker can still accurately track the correct
object in this challenging scenario.
Drones 2024, 8, 252 26 of 29

Figure 18. Qualitative evaluation results. From top to bottom, the sequences are car1, group2,
person14, uav180, uav1, group1, and uav93.

5. Conclusions
This paper proposes a visual target-tracking algorithm based on object prediction and
block search, aiming to solve the problem of tracking drift in UAV view. Specifically, when
the tracker experiences tracking drift due to object occlusion or moving out of view, our
approach predicts the motion state of the object using a Kalman filter. Then, the proposed
block search module efficiently searches for the tracking of the drifting target. In addition,
to enhance the adaptability of the tracker in changing scenarios, we propose a dynamic
template update network. This network employs the optimal template strategy based on
various tracking conditions to improve the tracker’s robustness. Finally, we introduce three
evaluation metrics: APCE, SCR, and TS. These metrics are used to identify tracking drift
status and provide prior information for object tracking in subsequent frames. Extensive
experiments and comparisons with many competitive algorithms on five aerial benchmarks,
namely, UAV123, UAV20L, UAVDT, DTB70, and VisDrone2018-SOT, have demonstrated
the effectiveness of our approach in resisting tracking drift in a complex UAV viewpoint
environment, and it achieved a real-time speed of 43 FPS.

Author Contributions: L.S. and X.L. conceived of the idea and developed the proposed approaches.
Z.Y. advised the research. D.G. helped edit the paper. All authors have read and agreed to the
published version of the manuscript.
Drones 2024, 8, 252 27 of 29

Funding: This research was funded by the National Natural Science Foundation of China (No.
62271193), the Aeronautical Science Foundation of China (No. 20185142003), Natural Science Foun-
dation of Henan Province, China (No. 222300420433), Science and Technology Innovative Talents
in Universities of Henan Province, China (No. 21HASTIT030), Young Backbone Teachers in Univer-
sities of Henan Province, China (No. 2020GGJS073), and Major Science and Technology Projects of
Longmen Laboratory (No. 231100220200).
Data Availability Statement: Code and data are available upon request from the authors.
Conflicts of Interest: Author Z.Y. was employed by the company Xiaomi Technology Co., Ltd., The
remaining authors declare that the research was conducted in the absence of any commercial or
financial relationships that could be construed as a potential conflict of interest.

References
1. Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53.
2. Han, Y.; Yu, X.; Luan, H.; Suo, J. Event-Assisted Object Tracking on High-Speed Drones in Harsh Illumination Environment.
Drones 2024, 8, 22.
3. Chen, Q.; Liu, J.; Liu, F.; Xu, F.; Liu, C. Lightweight Spatial-Temporal Contextual Aggregation Siamese Network for Unmanned
Aerial Vehicle Tracking. Drones 2024, 8, 24.
4. Memon, S.A.; Son, H.; Kim, W.G.; Khan, A.M.; Shahzad, M.; Khan, U. Tracking Multiple Unmanned Aerial Vehicles through
Occlusion in Low-Altitude Airspace. Drones 2023, 7, 241.
5. Gao, Y.; Gan, Z.; Chen, M.; Ma, H.; Mao, X. Hybrid Dual-Scale Neural Network Model for Tracking Complex Maneuvering UAVs.
Drones 2023, 8, 3.
6. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45.
7. Xie, X.; Xi, J.; Yang, X.; Lu, R.; Xia, W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones
2023, 7, 296.
8. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In
Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic,
27 September–1 October 2021; pp. 3086–3092.
9. Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783.
10. Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552.
11. Cheng, S.; Zhong, B.; Li, G.; Liu, X.; Tang, Z.; Li, X.; Wang, J. Learning to filter: Siamese relation network for robust tracking. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021;
pp. 4421–4431.
12. Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. Dropmae: Masked autoencoders with spatial-attention dropout for tracking
tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24
June 2023; pp. 14561–14571.
13. Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf.
Process. Syst. 2022, 35, 16743–16754.
14. Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of
the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022;
pp. 146–164.
15. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision–ECCV
2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany,
2016; pp. 445–461.
16. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark:
Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14
September 2018; pp. 370–386.
17. Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of
the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
18. Wen, L.; Zhu, P.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Liu, C.; Cheng, H.; Liu, X.; Ma, W.; et al. Visdrone-sot2018: The vision
meets drone single-object tracking challenge results. In Proceedings of the European Conference on Computer Vision (ECCV)
Workshops, Munich, Germany, 8–14 September 2018.
19. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980.
20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 1–9.
Drones 2024, 8, 252 28 of 29

21. Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual
tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA,
18–24 June 2020; pp. 6269–6277.
22. Paul, M.; Danelljan, M.; Mayer, C.; Van Gool, L. Robust visual tracking by segmentation. In Proceedings of the European
Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 571–588.
23. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of
the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117.
24. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457.
25. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15457–15466.
26. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In
Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg,
Germany, 2022; pp. 341–357.
27. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Tctrack: Temporal contexts for aerial tracking. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808.
28. Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-day object tracking for unmanned aerial vehicle. IEEE Trans. Mob. Comput. 2022, 22,
4515–4529.
29. Yang, J.; Gao, S.; Li, Z.; Zheng, F.; Leonardis, A. Resource-efficient RGBD aerial tracking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13374–13383.
30. Luo, Y.; Guo, X.; Dong, M.; Yu, J. RGB-T Tracking Based on Mixed Attention. arXiv 2023, arXiv:2304.04264.
31. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 1–9.
32. Wang, M.; Liu, Y.; Huang, Z. Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4021–4029.
33. Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking
in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 300–317.
34. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale
single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 15–20 June 2019; pp. 5374–5383.
35. Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans.
Pattern Anal. Mach. Intell. 2019, 43, 1562–1577.
36. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252.
37. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
38. Mayer, C.; Danelljan, M.; Yang, M.H.; Ferrari, V.; Van Gool, L.; Kuznetsova, A. Beyond SOT: Tracking Multiple Generic Objects at
Once. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January
2024; pp. 6826–6836.
39. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706.
40. Kim, M.; Lee, S.; Ok, J.; Han, B.; Cho, M. Towards sequence-level training for visual tracking. In Proceedings of the European
Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 534–551.
41. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135.
42. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677.
43. Tang, F.; Ling, Q. Ranking-based Siamese visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750.
44. Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. Lighttrack: Finding lightweight neural networks for object tracking via one-shot
architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 20–25 June 2021; pp. 15180–15189.
45. Zhang, D.; Zheng, Z.; Jia, R.; Li, M. Visual tracking via hierarchical deep reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3315–3323.
46. Guo, M.; Zhang, Z.; Fan, H.; Jing, L.; Lyu, Y.; Li, B.; Hu, W. Learning target-aware representation for visual tracking via
informative interactions. arXiv 2022, arXiv:2201.02526.
Drones 2024, 8, 252 29 of 29

47. Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021;
pp. 13339–13348.
48. Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9589–9600.
49. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the 2021
IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 510–516.
50. Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 8731–8740.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like