Visual Object Tracking Based On The Motion Predict
Visual Object Tracking Based On The Motion Predict
Article
Visual Object Tracking Based on the Motion Prediction and
Block Search in UAV Videos
Lifan Sun 1,2,3, *, Xinxiang Li 1 , Zhe Yang 4 and Dan Gao 1
1 School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China;
lxx@stu.haust.edu.cn (X.L.); d.gao@haust.edu.cn (D.G.)
2 Longmen Laboratory, Luoyang 471000, China
3 Henan Academy of Sciences, Zhengzhou 450046, China
4 Xiaomi Technology Co., Ltd., Beijing 100102, China; yangzhe11@xiaomi.com
* Correspondence: lifan.sun@haust.edu.cn
Abstract: With the development of computer vision and Unmanned Aerial Vehicles (UAVs) tech-
nology, visual object tracking has become an indispensable core technology for UAVs, and it has
been widely used in both civil and military fields. Visual object tracking from the UAV perspective
experiences interference from various complex conditions such as background clutter, occlusion, and
being out of view, which can easily lead to tracking drift. Once tracking drift occurs, it will lead
to almost complete failure of the subsequent tracking. Currently, few trackers have been designed
to solve the tracking drift problem. Thus, this paper proposes a tracking algorithm based on mo-
tion prediction and block search to address the tracking drift problem caused by various complex
conditions. Specifically, when the tracker experiences tracking drift, we first use a Kalman filter
to predict the motion state of the target, and then use a block search module to relocate the target.
In addition, to improve the tracker’s ability to adapt to changes in the target’s appearance and the
environment, we propose a dynamic template updating network (DTUN) that allows the tracker to
make appropriate template decisions based on various tracking conditions. We also introduce three
tracking evaluation metrics: namely, average peak correlation energy, size change ratio, and tracking
score. They serve as prior information for tracking status identification in the DTUN and the block
prediction module. Extensive experiments and comparisons with many competitive algorithms on
Citation: Sun, L.; Li, X.; Yang, Z.; Gao, five aerial benchmarks, UAV20L, UAV123, UAVDT, DTB70, and VisDrone2018-SOT, demonstrate that
D. Visual Object Tracking Based on our method achieves significant performance improvements. Especially in UAV20L long-term track-
the Motion Prediction and Block ing, our method outperforms the baseline in terms of success rate and accuracy by 19.1% and 20.8%,
Search in UAV Videos. Drones 2024, 8, respectively. This demonstrates the superior performance of our method in the task of long-term
252. https://doi.org/10.3390/ tracking from the UAV perspective, and we achieve a real-time speed of 43 FPS.
drones8060252
need to explore and solve the problems faced by object tracking technology from the
UAV perspective.
Traditional object-tracking algorithms usually use filtering methods (e.g., Kalman
filter [6]) to estimate the motion state of the object, and predict the position of the object
by establishing state equations and observation equations. However, these algorithms
are prone to the problem of tracking drift when the object undergoes sudden changes in
motion or when the sensor wobbles. In recent years, with the development of deep learning
(DL) technology, researchers have begun to apply DL techniques to visual object tracking
and have made significant progress. Due to their excellent performance, Siamese network
trackers [7,8] have been favored by many researchers. In visual object tracking, Siamese
networks are deep neural networks for learning target features and computing object simi-
larities. It consists of two similar convolutional neural networks (CNNs) that process the
template image and the search image respectively and output their feature representations.
Some derived methods such as STMTrack [9], SiamGAT [10], and SiamRN [11] are based on
Siamese networks. These methods usually improve the training method of the algorithm
and the similarity calculation process but do not address the tracking drift problem caused
by disturbances such as object occlusion and being out of view. As a result, these methods
are not advantageous for long-term tracking tasks in complex scenes.
With the improvement of computing resources and the success of the Transformer
architecture in the field of natural language processing, target tracking algorithms based
on Transformer architecture have become a hot research topic, such as DropTrack [12],
SwinTrack [13], and AiATrack [14]. In visual object tracking, tracking algorithms based
on the Transformer architecture mainly consist of an encoder and a decoder. The encoder
and decoder consist of multiple identical encoder layers, each containing a self-attention
mechanism and a feedforward neural network. The self-attention mechanism can model
associations between different locations in the input stream. Feedforward neural networks
introduce nonlinear transformations into the feature representation. SwinTrack [13] uses
the Transformer architecture for both representation learning and feature fusion. The
Transformer architecture allows for better feature interaction for tracking than the pure
CNN framework. AiATrack [14] proposes an attention-in-attention (AiA) module, which
enhances appropriate correlations by seeking consensus among all correlation vectors.
The AiA module can be applied to both self-attention blocks and cross-attention blocks to
facilitate feature aggregation. It is due to the attentional mechanism of the Transformer
architecture and the feed-forward neural network that the tracker performs the tracking
task with better robustness and generalization. However, for the same reason, trackers
based on the Transformer architecture are more computationally intensive. Therefore,
certain limitations exist in performing tracking tasks on UAV platforms.
In order to solve the tracking drift problem caused by complex conditions such as
occlusion and being out of view, this paper proposes a visual object tracking algorithm
based on motion prediction and block search and improves it for the tracking task from
the UAV perspective. We introduce three metrics for evaluating tracking results: average
peak correlation energy (APCE), size change ratio (SCR), and tracking score (TS). These
metrics aim to jointly identify the tracking status and provide prior information for the
proposed dynamic template updating network (DTUN). The proposed DTUN employs
the optimal template strategy based on different tracking statuses to improve the tracker’s
robustness. We utilize the Kalman filtering for motion prediction of the object when the
object is temporarily occluded or subject to other disturbances that cause tracking failure.
In cases of near-linear motion, the Kalman filter can be applied simply and efficiently. For
cases of the nonlinear motion of the object, the Kalman filter predicts the approximate
motion direction of the object, which is crucial for the block-search module. The proposed
block-prediction module mainly solves the long-term tracking drift problem. Considering
cases of tiny objects from the UAV perspective, we process the enlarged search region
with block search, significantly improving search efficiency for tiny objects. Overall, our
method performs well in solving the tracking drift problem in long-term tracking caused
Drones 2024, 8, 252 3 of 29
by complex statuses such as object occlusion and being out of view. Figure 1 shows a
visualization of our method compared with the baseline algorithm. When tracking drift
occurs due to object occlusion, it will continue if no action is taken (as shown by the blue
bounding box). In contrast, our method employs a target prediction and block search
framework, which can effectively relocate the tracker to the object (as shown by the red
bounding box).
Figure 1. Visualization results of the proposed MPBTrack method compared with the baseline on
the UAV123 dataset. The first row shows the bounding box of the object, where green, red, and blue
colors indicate the ground truth, our method and the baseline method, respectively. The second and
third rows display the response map results for our method and the baseline method, respectively.
(a) represents the tracking drift condition caused by the object being out of view. (b,c) represent the
tracking drift conditions caused by the occlusion. (d) represents the tracking error caused by the
occlusion and the interference from a similar object.
Extensive experiments on five aerial datasets, UAV20L [15], UAV123 [15], UAVDT [16],
DTB70 [17], and VisDrone2018-SOT [18], demonstrate the superior performance of our
method. In particular, on the large-scale long-term tracking dataset UAV20L, our method
achieves a significant improvement of 19.1% and 20.8% in terms of success and accuracy,
respectively, compared to the baseline method. In addition, our method achieves a tracking
speed of 43 FPS, which far exceeds the real-time requirement. This demonstrates the high-
efficiency and high-accuracy performance of our method in performing the object tracking
task from the UAV perspective. Figure 2 shows the results of our method compared with
other methods in terms of success rate and tracking speed.
In summary, the main contributions of this paper can be summarized in the following
three aspects:
(1) We introduce three evaluation metrics: the APCE, SCR, and TS, which are used to
evaluate the tracking results for each frame. These evaluation metrics are used to
jointly identify the tracking status and provide feedback information to the DTUN in
the tracking of subsequent frames. The proposed DTUN adjusts the template strategy
Drones 2024, 8, 252 4 of 29
DropTrack MPBTrack(Ours)
ROMTrack
0.70 OSTrack-256
RTS
ARTrack SLT-TransT
ToMP-101
PrDiMP-50
0.65
Success
TaMOs STMTrack
0.60 TransT50
SiamBAN
SiamCAR
0.55 CNNInMo
SLT-SiamAttn
0 10 20 30 40 50
FPS
Figure 2. Performance comparison of our method with others on the UAV20L long-term tracking
dataset in terms of success and speed.
This paper is organized as follows: Section 2 describes other work related to our
approach. Section 3 describes the framework and specific details of our proposed approach.
Section 4 shows the results of experiments on five aerial datasets and compares them with
other methods. Finally, our work is summarized in Section 5.
2. Related Work
Visual object tracking is an important task in computer vision that has developed
rapidly in recent years. It aims to accurately track objects in video sequences in real time. In
addition, the application of visual object tracking in the field of UAVs is receiving increasing
attention and plays an important role. In this chapter, the research on visual object tracking
algorithms and their development on UAVs is discussed.
Recently, the Transformer model achieved great success in natural language processing and
computer vision with the improvement of computational resources. Transformer-based
tracking algorithms achieve more accurate tracking by introducing an attention mechanism
to model the relationship between the object and its surrounding context. These methods
take a sequence of object features as input and use a multilayer Transformer encoder to
learn the object’s representation.
In recent years, Siamese network-based tracking algorithms have been extensively
researched. Li et al. proposed SiamRPN [19] by introducing the regional proposal network
(RPN) [20] into the Siamese network. RPN consists of two branches, object classification
and bounding box regression. The introduction of RPN eliminates the traditional multiscale
testing and online fine-tuning process, and greatly improves the accuracy and speed of
object tracking. Guo et al. argued that anchor-based regression networks require tricky
hyperparameters to be set manually, so they proposed an anchor-free tracking algorithm
(SiamCAR) [21]. Compared to the anchor-based approach, the anchor-free regression
network has fewer hyperparameters. It directly calculates the distance from the center
of the object to the four edges during the bounding box regression process. Fu et al.
argued that the fixed initial template information has been fully mined and the existing
online learning template update process is time-consuming. Therefore, they proposed a
tracking framework based on space-time memory networks (STMTrack) [9] that utilizes
the historical templates from the tracking process to better adapt to changes in the object’
appearance. RTS [22] proposed a segmentation-centered tracking framework that can
better distinguish between object and background information. It can generate accurate
segmentation masks in the tracking results, but there is a reduction in the tracking speed.
Although the above trackers improve tracking accuracy by improving the model training
method and bounding box regression, they do not effectively solve the tracking drift
problem due to complex situations. In addition, Zheng et al. [23] argued that the imbalance
of the training data makes the learned features lack significant discriminative properties.
Therefore, in the offline training phase, they made the model more focused on semantic
interference by controlling the sampling strategy. In the inference phase, an interference-
aware module and a global search strategy are used to improve the tracker’s resistance to
interference. However, this global search strategy is not good for tracking tiny objects from
the UAV perspective, especially when there are similar objects around, background clutter,
or low resolution.
After Siamese networks, Transformer large-model-based trackers also achieved ex-
cellent results. Yan et al. [24] presented a tracking architecture with an encoder–decoder
transformer (STARK). The encoder models the global spatio-temporal feature information
of the target and the search region, and the decoder predicts the spatial location of the
target. The encoder–decoder transformer captures remote dependencies in both spatial and
temporal dimensions and does not require subsequent hyperparameter processing. Cao
et al. proposed an efficient Hierarchical Feature Transformer (HiFT) [25], which inputs hier-
archical similarity maps into the feature transformer for the interactive fusion of spatial and
semantic cues. It not only improves the global contextual information but also efficiently
learns the dependencies between multilevel features. Ye et al. [26] argued that the features
extracted by existing two-stage tracking frameworks lack target perceptibility and have
limited target–background discriminability. Therefore, they proposed a one-stage tracking
framework (OSTrack), which bridges templates and search images with bidirectional in-
formation flows to unify feature learning and relation modeling. To further improve the
inference efficiency, they proposed an in-network candidate early elimination module to
gradually discard candidates belonging to the background. The above Transformer model-
based tracker achieves significant performance improvement in visual object tracking, but
it requires high computational resources, which are not advantageous for applications on
UAV platforms.
Drones 2024, 8, 252 6 of 29
3. Proposed Method
This section provides a comprehensive overview of the proposed method. Section 3.1
outlines the overall structure, Section 3.2 delves into the dynamic template updating net-
work, Section 3.3 discusses the specifics of the search–evaluation network, and Section 3.4
presents details related to the block-prediction module.
other interference, it will predict the target’s motion trajectory using a Kalman filter. If the
object’s position is not accurately predicted within 20 frames, a block search in an expanded
area is utilized to relocate the object.
Feature Memory
Classification Expand Block
Template Extraction Search
φ1(z)
Mechanism APCE
φ2(z)
Search APCE Score
Search Region Feature
φ3(z)
Center-ness
φ4(z) Status
SCR Recognizer
Size Change Ratio
Template
Initial Template Feature Regression
φi-2(z) Tracking Score
concat. TS
φi-1(z)
(cx,cy)
φi(z) Feature
Concatenate
Motion Prediction
Previous Template
Figure 3. Overall network framework of MPBTrack. “⋆” indicates cross correlation operation.
Initial Feature
Initial Template
APCE Quality
Feedback
+ Network
TS
Concatenation
Response Map
Feature
Search Feature
Search Image
High Low
Quality Quality
Feature Memory
Abandon
Figure 4. Dynamic template updating network structure. “+” indicates concatenation operation. The
red box in the response map represents the maximum tracking score.
Drones 2024, 8, 252 8 of 29
The TS metric is employed to assess the quality of the tracked region for each frame. If
the TS of the current region exceeds a predefined threshold (0.6), it will be incorporated
into the feature memory as a new template, following the processes of cropping and
feature extraction. Specifically, the new template will perform the following operations
to yield the optimal template features: (1) obtain the cropping size of the current image
according to Equation (23), and then crop with the target as the center point; (2) resize the
cropped region to 289 × 289 to obtain a template containing the foreground region and the
background region; (3) generate a foreground–background mask map with the same size
as the foreground and the background in the template; and (4) the cropped template and
the mask map are jointly fed into the feature extraction network to obtain the template
features (the purpose is to improve the tracker’s discrimination between the target and the
background), and finally the template features are stored in the feature memory.
The APCE score measures the quality of the response map of the tracking results
(described in more detail in Section 3.3.2). The tracker receives feedback on the APCE
results at each frame and transforms the APCE values into the corresponding number of
templates. The equation for this transformation process is as follows:
Nmax − 1
T = Nmax − (1)
1 + exp(− a( APCE − b))
where T denotes the number of templates, and Nmax denotes the maximum value of the
template range. We set the template range to 1–10, and a and b denote the slope and
horizontal offset of the function, respectively. If the tracking quality of the previous frame is
high, a small number of templates will be used in the next frame to achieve better tracking
results. When the tracker receives a lower tracking quality in the previous frame, more
historical templates will be utilized to adapt to the current complex tracking situation.
The template extraction mechanism is an important part of the dynamic template-
updating process. A large number of historical templates are stored in the template memory,
and effective utilization of these templates is crucial to the robustness and accuracy of
tracking. By using Equation (1), we can calculate the number of templates that need to
be used in the next frame, and then use the template extraction mechanism to select high-
quality and diverse templates from the historical ones, suitable for tracking in the next
frame. The template extraction can be denoted as:
t
τi = ⌊ ⌋ × i, i = 0, 1, 2, . . . , N (2)
N
Tj = maxrag τj , τj+1 , j = 0, 1, 2, . . . , N − 1 (3)
Tcon = concat( T0 , . . . , TN −1 ) (4)
where t is the total number of templates in the library, and N is the number of tracked tem-
plates. tau j denotes the j-th segmentation point, and max arg [ φ1 , φ2 ] denotes the maximum
value in the interval [ φ1 , φ2 ]. concat(, ) denotes the concatenation operation. Assuming
that N templates are needed for the next frame, the specific extraction steps are as follows:
(1) The initial template is necessary, as it contains primary information about the tracking
target. Note that we exclude the last frame template, as it may introduce additional inter-
fering information. (2) The templates in the template memory are sequentially divided into
N − 1 segments, and then the template with the highest tracking score is selected from each
segment to serve as the optimal tracking template. This extraction mechanism can enhance
the diversity of templates and extract high-quality templates, thus significantly improving
the robustness of the tracker.
Drones 2024, 8, 252 9 of 29
where φ denotes the feature extraction operation, ⋆ denotes the cross correlation operation,
and concat(, ) denotes the concatenation operation. Following the response map, the
classification convolutional neural network and regression convolutional neural network
are used to obtain the classification feature maps Rcls and regression feature maps Rreg ,
respectively. The purpose of the classification branch is to classify the target to be tracked
from the background. The classification branch includes a center-ness branch that boosts
confidence for positions closer to the center of the image. Multiplying the classification
response map scls with the center-ness response map sctr suppresses the classification
confidence for locations farther from the center of the target, resulting in the final tracking
response map. The purpose of the regression branch is to determine the distance from the
center of the target to the left, top, right, and bottom edges of the target bounding box in
the search image.
Classification
CNN
zz CNN Center-ness
zz R*
LTRB
Search Image Search Feature
Figure 5. Search network structure. “⋆” denotes the cross correlation operation. R∗ represents the
response map.
During the training phase of the network model, the classification branch is trained
using the focal loss function, which can be expressed as:
Intersection( B, B∗ )
Lreg = 1 − (7)
Union( B, B∗ )
Drones 2024, 8, 252 10 of 29
where B is the predicted bounding box and B∗ is its corresponding ground-truth bounding
box. The center-ness branch uses a binary cross-entropy loss (BCE), which can be written as:
where p(i,j) represents the center-ness score at point (i, j). c(i,j) denotes the label value at
position (i, j). The final loss function is:
1 λ λ
Loss =
N ∑ Lcls + N ∑ Lreg + N ∑ Lcen (9)
x,y x,y x,y
| Fmax − Fmin |2
APCE = ! (10)
2
mean ∑ ( Fw,h − Fmin )
w,h
where Fmax and Fmin represent the maximum value and minimum value of the response
map of the tracking result, respectively. Fw,h is the response values in the response map at
position (w, h).
Figure 6 shows the APCE values and response maps of the tracking results for different
statuses. The APCE values are high under normal conditions, and the response map of the
tracking result exhibits a stiff single peak form. However, when the target is affected by
a cluttered background, the APCE value decreases, and the response map of the tracking
result begins to show a trend of multiple peaks. When the target is occluded, the value
of APCE decreases significantly, and the response map shows a low multipeak form. Our
proposed DTUN is able to adapt to tracking situations where the impact on the target
is small. However, when the target suffers from more serious impacts (e.g., occlusion),
it will lead to tracking drift and further to complete failure in subsequent tracking if it
cannot be effectively addressed. With the APCE metric, we can better measure the tracking
results and determine the extent to which the target is affected by the external environment.
This provides a more effective a priori guide for us to identify the tracking status in the
block-prediction module.
Size Change Ratio: During the target-tracking process, the size of the target typically
changes continuously, without any significant changes between consecutive frames. If
the size of the target changes significantly in consecutive frames, external interference
has likely caused the tracker to drift. Smaller influences are usually insufficient to cause
tracking drift, and only severe influences (e.g., target occlusion) can cause tracking drift. To
Drones 2024, 8, 252 11 of 29
identify tracking drift in the block-prediction module, we introduce the size change ratio as
an indicator. The size change ratio is expressed as:
2
m ∑im=⌊ m ⌋ Fwi ×h
2
SCR = (11)
Fwc ×h
where m denotes the number of templates in the template memory. Fwi ×h denotes the size
of the i-th target template in the template memory, and Fwc ×h denotes the template size of
the current frame.
Figure 6. The heatmaps and APCE values of response maps for different tracking statuses. (a) Normal
situation, (b) background clutter, and (c) object occlusion.
Specifically, in order to obtain a more accurate SCR value, target sizes of poor quality
are discarded. The average of the latter m/2 target sizes in the template memory is used
to calculate the historical target size. Then, the ratio of the historical target size relative to
the current target size is calculated. If this ratio falls within the threshold, the target size
is considered to be within the normal range of variation. Otherwise, a sudden change in
size is considered to have occurred. Such sudden size changes are often caused by tracking
drift due to severe impacts (e.g., occlusion) on the target. Therefore, we use the SCR as a
tracking status identification metric to assess whether the tracker is experiencing tracking
drift. Considering that the rapid movement of the sensor may also cause the target size to
change rapidly in a short period, which may lead to the misjudgment of the tracker, so we
add the tracking score and the APCE score to support the judgment. Tracking drift is only
considered to have occurred when all three conditions meet the threshold, which is then
transmitted to the condition recognizer. This auxiliary judgment prevents the misjudgment
of the tracker when the tracking condition is good and effectively improves the recognition
accuracy of tracking drift.
Figure 7 displays the SCR variation curve during target tracking. The blue curve
and bounding box represent the baseline algorithm, while the red curve and bounding
box represent our method. Tracking drift occurs, and the size of the target bounding box
gradually increases after the target is occluded starting from frame 1837. The blue curve
and target bounding box indicate that the baseline algorithm has entered a tracking drift
state. At frame 2148, when the SCR exceeds the threshold, it prompts our method to utilize
the block prediction to relocate the target. As a result, the target is successfully tracked at
frame 2343. This demonstrates the significance of the SCR metric in determining tracking
drift and the effectiveness of our method in resolving such situations.
Drones 2024, 8, 252 12 of 29
#4830
Tracking score: The tracking score is a direct measure of the quality of tracking results.
In Siamese networks, the tracking score is calculated by multiplying the classification confi-
dence and center-ness score. The classification confidence reflects the similarity between
the template and the search region, while the center-ness reflects the distance between the
target and the center point in the search image. The classification confidence scls can be
multiplied by the center-ness sctr to suppress the score for positively classified targets that
are far from the target center. The tracking score strc can be expressed as: strc = scls × sctr .
To suppress large variations in the target scale, the scale penalty function is used to penalize
large-scale variations in the target, and the process can be written as:
r r′ s s′
s∗ = strc × pn = strc × ek×max( r′ , r )×max( s′ , s ) (12)
where k is a hyperparameter. r represents the proposal’s ratio of height and width, and r ′
represents that of the last frame. s and s, represent the overall scale of the proposal and last
frame, respectively. s is calculated by:
( w + p ) × ( h + p ) = s2 (13)
where w and h denote the width and height of the target, respectively, and p is the padding
value. Additionally, to suppress scores away from the center, the response map is post-
processed using a cosine window function. The final tracking score TS can be expressed as:
10 Based on the received APCE feedback, calculate the number T of templates for
Nmax −1
the i + 1st frame using Equation T = Nmax − 1+exp(− a( APCE−b))
;
11 if T == 1 then
12 Tcon = T1 ;
13 end
14 else
15 Extract high-quality and diverse templates from the feature memory using
Equations τi = ⌊ Nt ⌋ × i, Tj = maxrag τj , τj+1 and
Tcon = concat( T0 , . . . , TN −1 );
16 end
17 end
occlusion. This is particularly problematic for long-term tracking, as tracking drift can lead
to complete failure of subsequent tracking. Therefore, it is crucial to introduce a target
prediction process to address tracking drift caused by target occlusion. In airborne tracking,
due to the characteristics of distant sensors and small targets, the motion process before the
target is occluded and can be regarded as approximately linear.
The Kalman filter [6] is used in the target prediction stage. The state and observation
equations of the Kalman filter can be expressed as:
zk = Hxk + vk (16)
where k denotes the moment of the kth frame, xk is the state vector, and zk is the observation
vector. H is the observation matrix, and H = I, where I is the unit matrix. A is the state
transition matrix. wk−1 and vk are the process error and observation error, respectively, and
are assumed to be subject to Gaussian distributions with covariance matrices Q and R. We
set Q = I × 0.1 and R = I. During the process of object tracking, the updating process
takes up most of the time, while the prediction process takes up very little time. Therefore,
it can be assumed that the size of the target will not change significantly in a short period.
The state space of the target is set as follows:
X = [ x, y, w, h, v x , vy ] T (17)
where x, y denote the center coordinates of the target, and w and h denote the width and
height of the target, respectively. v x and vy denote the change ratio of the center coordinates
of the target, respectively.
The Kalman filtering computational steps consist of the prediction process and the
update process. The prediction process focuses on predicting the state and error covariance
variables. It can be expressed as:
x̂k− = A x̂k−−1 (18)
p̂− T
k = A p̂k −1 A + Q (19)
where Pk−1 is the error covariance of the prediction for the k − 1st frame. The state update
phase includes the optimal estimation of the system state and the update of the error
covariance matrix. It can be expressed as:
pk = ( I − Kk H ) p−
k (21)
Kk = p− T − T
k H ( H pk H + R)
−1
(22)
where Kk is the Kalman filter gain of the kth frame. zk is the actual observation of the kth frame.
the green rectangular box indicates the original search area. The target is not within the
search area due to the small size of the search area. Therefore, in the block search module,
the search area will be expanded first. The process can be expressed as follows:
q
tiw−1
sw = × 289 × n
b
q w (23)
tih−1
sh = × 289 × n
bh
where sw and sh represent the width and height of the search area, respectively. n is the
magnification factor. In block search, the value of n is 3 when tracking drift occurs because
the object moves out of view, and 2 when tracking drift occurs in other situations. bw and
bh are the width and height of the initially sampled image, obtained as follows:
t1w t1
bw = , bh = h (24)
ts ts
q
t1w × t1h × p2
ts = (25)
289
where t1w and t1h are the width and height of the initial target, respectively. p is the search
region factor, typically set to 4. Due to the enlarged search region being larger than the
target, accurately localizing the target can be difficult if searching for it directly within the
region. Figure ➂ shows the response map results of searching in this way. As the response
map shows, this direct search makes target localization difficult. Therefore, as shown in
Figure ➃, we segment the enlarged search area into 3 × 3 blocks and search for the target in
each block. Block search effectively overcomes the challenge of accurately localizing small
targets in a larger search area. Figure ➄ shows the response map result of the block search,
which accurately localizes the target in the block image containing it. The target region has
the highest response value score, while the other blocks have significantly lower scores.
② Expand Image ③ Expand Response Map ⑥ target location
0 1 2
w
predicted S
0
trajectory
Motion
trajectory
(x0,y0)
1
(x1,y1)
① Target Prediction 2
(x',y')
h (Cw,Ch)
④ Block Image ⑤ Block Response Map
Figure ➅ illustrates the calculation of the target center position. The center coordinates
( x, y) of the target in the original search image are calculated as follows:
√
C
x = (cw − ⌊ ⌋) × S + x ′ − x1 + x0
√2 (26)
C ′
y = (ch − ⌊ ⌋) × S + y − y1 + y0
2
where C represents the number of block subimages. (cw , ch ) represents the coordinate
position of the block subimage where the target is located, with the horizontal and vertical
Drones 2024, 8, 252 16 of 29
axes being represented by w and h respectively. S denotes the width and height of the
square block subimage. ( x0 , y0 ) represents the coordinates of the center position of the
search area x0 = s2w , y0 = s2h . ( x1 , y1 ) represents the coordinates of the center position of
the block subimage where the maximum score of the target is located, x1 = y1 = S2 . ( x ′ , y′ )
represents the coordinates of the position with the maximum score in the block search
response map. The width w and height h of the target bounding box are calculated using
the following equations:
w = (1 − r ) wi −1 + r × w p
(27)
h = (1 − r ) h i −1 + r × h p
where w p and h p denote the width and height of the predicted target, respectively. wi−1
and hi−1 denote the width and height of the target in the previous frame, respectively. r is
derived by:
r = pn ( argmax {strc }) × max {strc } × q (28)
where q is a hyperparameter, strc is the tracking score, and pn represents the scale penalty
function.
Tracking drift can also be caused by targets moving out of view, which makes it difficult
for the tracker to localize the target due to the uncertainty in the center of the search area.
To address this issue, the enlarged block search method is utilized. To bring the search area
closer to the target’s reappearance, we calculate the average of the target’s historical trajectory
coordinates and use it as the center of the new search area. The search area is expanded to
three times the original size, and a 5 × 5 block search is used to search the target.
When the target moves out of the field of view, the tracker is unable to track the target
correctly. During this time, the tracker will perform expanded block search every 20 frames.
(It should be noted that the Kalman filter is not effective in acquiring the target when it
moves out of view or reappears. In this case, only block search will be employed, and
the Kalman filter will not be used.) It is not until the target reappears within the field of
view (e.g., the target changes its direction of motion or the UAV turns its camera towards
the target) that the expanded block search can correctly localize the target. Given the
uncertainty surrounding the moment of target reappearance, the expanded block search is
conducted every 20 frames, which avoids repeated search calculations and improves the
tracking efficiency.
Figure 9 shows the overlap rate graphs of the proposed method compared to the
baseline algorithm in the UAV20L long-term tracking dataset. The overlap ratio is defined
as the intersection-over-union ratio between the predicted target bounding box and the
ground truth. A low overlap ratio in the graph indicates tracking drift, which may be
caused by occlusion or the disappearance of the target. Tracking drift leads to an almost
complete failure of subsequent tracking, as shown by the blue curve in the figure. Our
method (red curve) effectively re-tracks the target after tracking drift, demonstrating its
effectiveness in solving the tracking drift problem.
Figure 9. Cont.
Drones 2024, 8, 252 17 of 29
Figure 9. Comparison results of overlap rates on the UAV20L long-term tracking dataset. The
red curve represents the method using motion prediction and block search, while the blue curve
represents the baseline method. The intermittent blank areas in the figure indicate cases where the
target disappears, resulting in no overlap rate values.
4. Experiments
This section presents the experimental validation of the MPBTrack algorithm. Section 4.1
describes the details of model training and experimental evaluation. In Section 4.2, we per-
form ablation experiments for analysis. Sections 4.3–4.7 present quantitative experimental
results, and Section 4.8 presents a qualitative experimental analysis.
experimental analysis determined that the frame interval suitable for tracking in UAVs
is 20.
Table 1. Analysis of ablation experiments using DTUN and BPM on the UAV123 and UAV20L
datasets.
UAV20L UAV123
Module
Success Precision Success Precision
Baseline 0.589 0.742 0.647 0.825
Baseline + DTUN 0.617 0.784 0.651 0.834
Baseline + BPM 0.683 0.875 0.651 0.841
Baseline + DTUN + BPM 0.706 0.905 0.656 0.842
UAV20L VisDrone2018-SOT
Interval Frames
Success Precision FPS Success Precision FPS
10 0.697 0.893 32.4 0.637 0.822 27.9
15 0.700 0.898 32.6 0.637 0.823 28.3
20 0.706 0.905 43.5 0.665 0.864 43.4
25 0.689 0.811 32.8 0.633 0.818 26.7
Section 3.3.2 discusses the importance of the SCR metric in determining tracking
drift conditions in UAV video object tracking. Choosing a suitable SCR threshold can
improve the accuracy of recognizing tracking conditions. Table 3 demonstrates the impact
of different SCR values on tracking performance. The table shows that choosing a threshold
that is too large or too small results in misjudging the tracking drift condition and further
leads to the degradation of tracking performance.
Table 3. Experimental results for various SCR thresholds on the UAV20L dataset.
In terms of success, the scores are ARC (0.683), BC (0.613), CM (0.698), FO (0.599), IV (0.663),
OV (0.703), PO (0.689), SV (0.706), and VC (0.742); in terms of precision, the scores are ARC
(0.881), BC (0.870), CM (0.900), FO (0.857), IV (0.856), OV (0.898), PO (0.894), SV (0.900),
and VC (0.906). The attribute evaluations demonstrate the remarkable performance of
the proposed DTUN and BPM in dealing with various complex situations, proving the
effectiveness of our approach.
0.8 0.8
0.7 0.7
[0.706] MPBTrack [0.905] MPBTrack
[0.690] OSTrack-256 [0.904] OSTrack-256
0.6 [0.682] RTS 0.6 [0.882] RTS
Success rate
Precision
0.5 [0.673] ARTrack 0.5 [0.874] ARTrack
[0.653] PrDiMP-50 [0.851] PrDiMP-50
[0.606] SiamGAT [0.829] TaMOs
0.4 0.4
[0.597] TaMOs [0.777] SiamGAT
[0.593] STMTrack [0.776] TransT50
0.3 [0.587] TransT50 0.3 [0.749] STMTrack
[0.569] SiamBAN [0.739] SiamBAN
0.2 [0.551] SiamCAR [0.729] SiamCAR
0.2
[0.518] SiamAPN [0.692] SiamAPN
[0.514] TCTrack [0.690] TCTrack
0.1 [0.444] Ocean 0.1 [0.630] Ocean
[0.309] LightTrack [0.414] LightTrack
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold
Success rate
Success rate
0.8
0.7 0.8
0.7
0.6 0.7
[0.599] MPBTrack [0.687] OSTrack-256 [0.703] MPBTrack
[0.542] OSTrack-256 [0.663] MPBTrack 0.6 [0.687] SLT-TransT
[0.525] RTS 0.6 [0.651] RTS [0.686] ARTrack
0.5
Success rate
Success rate
Success rate
Success plots of OPE - Partial Occlusion Success plots of OPE - Scale Variation Success plots of OPE - Viewpoint Change
0.9 0.9
0.9
0.8 0.8
0.8
0.7 0.7
[0.689] MPBTrack [0.706] MPBTrack 0.7 [0.742] MPBTrack
[0.673] OSTrack-256 [0.690] OSTrack-256 [0.715] ARTrack
0.6 0.6
[0.663] RTS [0.680] RTS 0.6 [0.715] OSTrack-256
Success rate
Success rate
Success rate
[0.662] ARTrack [0.676] SLT-TransT [0.713] SLT-TransT
0.5 [0.660] SLT-TransT 0.5 [0.672] ARTrack 0.5 [0.701] RTS
[0.634] PrDiMP-50 [0.650] PrDiMP-50 [0.695] PrDiMP-50
[0.604] SiamGAT [0.600] SiamGAT [0.624] TaMOs
0.4 0.4 0.4
[0.574] TaMOs [0.592] TaMOs [0.618] SiamGAT
[0.565] TransT50 [0.587] STMTrack [0.593] STMTrack
0.3 [0.563] STMTrack 0.3 [0.581] TransT50 0.3 [0.587] TransT50
[0.539] SiamBAN [0.562] SiamBAN [0.562] SiamBAN
0.2 [0.526] SiamCAR 0.2 [0.546] SiamCAR [0.532] SiamAPN
0.2
[0.488] SiamAPN [0.510] SiamAPN [0.517] SiamCAR
[0.485] TCTrack [0.506] TCTrack [0.517] TCTrack
0.1 [0.417] Ocean 0.1 [0.430] Ocean 0.1 [0.418] Ocean
[0.285] LightTrack [0.322] LightTrack [0.337] LightTrack
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold
Figure 11. Success plot for attribute evaluation on the UAV20L dataset.
Precision plots of OPE - Aspect Ratio Change Precision plots of OPE - Background Clutters Precision plots of OPE - Camera Motion
0.9
0.9 0.9
0.8 0.8
0.8
0.7 0.7
0.7
[0.881] MPBTrack [0.870] MPBTrack [0.900] MPBTrack
[0.881] OSTrack-256 0.6 [0.809] OSTrack-256 [0.899] OSTrack-256
0.6 [0.853] RTS [0.772] TaMOs 0.6 [0.876] RTS
[0.847] SLT-TransT [0.744] PrDiMP-50 [0.871] SLT-TransT
Precision
Precision
Precision
0.5 [0.843] ARTrack 0.5 [0.721] SLT-TransT 0.5 [0.868] ARTrack
[0.817] PrDiMP-50 [0.711] RTS [0.843] PrDiMP-50
[0.794] TaMOs 0.4 [0.671] ARTrack [0.820] TaMOs
0.4 0.4
[0.722] SiamGAT [0.647] SiamGAT [0.765] SiamGAT
[0.721] TransT50 [0.576] SiamCAR [0.764] TransT50
0.3 [0.687] STMTrack 0.3 [0.464] TCTrack 0.3 [0.736] STMTrack
[0.675] SiamBAN [0.439] TransT50 [0.726] SiamBAN
0.2 [0.662] SiamCAR 0.2 [0.424] STMTrack [0.715] SiamCAR
0.2
[0.616] SiamAPN [0.407] Ocean [0.676] SiamAPN
[0.613] TCTrack [0.392] SiamBAN [0.674] TCTrack
0.1 [0.538] Ocean 0.1 [0.372] LightTrack 0.1 [0.610] Ocean
[0.439] LightTrack [0.367] SiamAPN [0.399] LightTrack
0.0 0.0 0.0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Location error threshold Location error threshold Location error threshold
Precision
Precision
Precision
Precision
Figure 12. Precision plot for attribute evaluation on the UAV20L dataset.
the results of the comparison experiments between our method and other competitors, such
as TaMOs [38], HiFT [25], STMTrack [9], SiamPW-RBO [43], and LightTrack [44]. Compared
with the baseline method, our method delivers competitive performance, improving success
and precision by 1.4% and 2.1%, respectively.
Table 4. Experimental results comparing our method with other methods on the UAV123 dataset.
Trackers are ranked based on their success scores.
Ta- TC- PAC- Siam- Light- CNN- Siam- Siam- Auto- SiamPW- Siam- STM- MPB-
Tracker
MOs [38] HiFT [25] Track [27] Net [45] CAR [21] Track [44] InMO [46] BAN [42] RN [11] Match [47] RBO [43] GAT [10] Track [9] Track
Succ. 0.571 0.589 0.604 0.620 0.623 0.626 0.629 0.631 0.643 0.644 0.645 0.646 0.647 0.656
Prec. 0.791 0.787 0.800 0.827 0.813 0.809 0.818 0.833 - - - 0.843 0.825 0.842
0.8 0.8
0.7 0.7
[0.673] MPBTrack [0.872] MPBTrack
[0.660] STMTrack [0.862] STMTrack
0.6 0.6
[0.638] ARTrack [0.852] ARTrack
Success rate
Success plots of OPE - background clutter Success plots of OPE - camera motion Success plots of OPE - illumination variations
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6 [0.608] MPBTrack [0.651] MPBTrack [0.695] MPBTrack
[0.586] STMTrack 0.6 [0.641] STMTrack [0.682] STMTrack
[0.571] ARTrack [0.606] TransT 0.6 [0.662] ARTrack
Success rate
Success rate
Success rate
0.5 [0.544] ROMTrack [0.601] ARTrack [0.639] ROMTrack
[0.527] SiamBAN 0.5 [0.577] ROMTrack [0.637] SiamBAN
0.5
[0.520] TransT [0.574] SiamCAR [0.624] SiamCAR
0.4 [0.516] LightTrack [0.567] DropTrack [0.608] LightTrack
0.4 0.4
[0.513] DropTrack [0.563] SiamBAN [0.603] DropTrack
0.3 [0.499] SiamCAR [0.563] SiamGAT [0.597] SiamAPN++
[0.490] SiamGAT 0.3 [0.545] LightTrack 0.3 [0.591] TransT
[0.465] SiamAPN++ [0.520] SiamAPN++ [0.586] Ocean
0.2 [0.457] Ocean 0.2 [0.519] TCTrack 0.2 [0.574] SiamGAT
[0.425] TCTrack [0.506] Ocean [0.560] TCTrack
[0.391] SiamAPN [0.452] SiamAPN [0.522] SiamAPN
0.1 [0.335] SESiamFC 0.1 [0.375] SESiamFC 0.1 [0.438] SESiamFC
[0.327] TaMOs-R50 [0.358] TaMOs-R50 [0.376] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold
Success rate
Success rate
0.5 [0.589] ARTrack 0.6 [0.746] SiamGAT [0.619] ROMTrack
[0.568] ROMTrack [0.736] SiamBAN 0.5 [0.596] SiamBAN
[0.558] DropTrack 0.5 [0.725] TransT [0.594] SiamCAR
0.4 [0.547] LightTrack [0.725] TCTrack [0.591] TransT
[0.527] SiamGAT [0.722] ARTrack 0.4 [0.577] DropTrack
0.4
0.3 [0.527] SiamBAN [0.716] DropTrack [0.560] SiamAPN++
[0.509] SiamCAR [0.710] LightTrack 0.3 [0.558] SiamGAT
[0.456] SiamAPN++ 0.3 [0.701] ROMTrack [0.555] LightTrack
0.2 [0.428] TCTrack [0.694] SiamAPN [0.533] SiamAPN
0.2
[0.426] Ocean 0.2 [0.692] SiamAPN++ [0.529] Ocean
[0.425] SiamAPN [0.617] Ocean [0.517] TCTrack
0.1 [0.390] TaMOs-R50 [0.525] SESiamFC 0.1 [0.404] SESiamFC
0.1
[0.325] SESiamFC [0.417] TaMOs-R50 [0.313] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold
Success plots of OPE - object motion Success plots of OPE - scale variations Success plots of OPE - small object
Success rate
Success rate
[0.625] ROMTrack [0.650] ROMTrack [0.585] TransT
0.5 [0.610] TransT [0.629] TransT [0.578] SiamGAT
0.5 0.5
[0.585] SiamCAR [0.622] DropTrack [0.578] SiamBAN
[0.578] SiamBAN [0.619] LightTrack [0.577] ROMTrack
0.4 [0.573] LightTrack 0.4 [0.617] SiamBAN 0.4 [0.567] SiamCAR
[0.569] DropTrack [0.617] SiamCAR [0.556] DropTrack
0.3 [0.563] SiamGAT 0.3 [0.575] SiamGAT 0.3 [0.542] SiamAPN++
[0.520] SiamAPN++ [0.559] SiamAPN++ [0.537] TCTrack
[0.492] SiamAPN [0.529] SiamAPN [0.532] Ocean
0.2 0.2 0.2
[0.477] Ocean [0.507] Ocean [0.527] LightTrack
[0.477] TCTrack [0.495] TCTrack [0.522] SiamAPN
0.1 [0.365] TaMOs-R50 0.1 [0.401] TaMOs-R50 0.1 [0.415] SESiamFC
[0.329] SESiamFC [0.386] SESiamFC [0.271] TaMOs-R50
0.0 0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Overlap threshold Overlap threshold Overlap threshold
Figure 14. Success plots for attribute evaluation on the UAVDT dataset.
0.9 0.9
0.8 0.8
[0.670] MPBTrack [0.877] ARTrack
0.7 0.7
[0.666] OSTrack [0.867] ROMTrack
[0.664] ARTrack [0.866] OSTrack
0.6 [0.661] ROMTrack 0.6 [0.847] KeepTrack
Success rate
[0.655] KeepTrack [0.844] MPBTrack
Precision
0.5 [0.655] STMTrack 0.5 [0.836] TransT
[0.645] TransT [0.832] SiamBAN
[0.643] SiamBAN [0.831] SiamCAR
0.4 [0.622] TCTrack 0.4 [0.825] STMTrack
[0.603] SiamCAR [0.813] TCTrack
0.3 [0.594] SiamAPN++ 0.3 [0.790] SiamAPN++
[0.587] LightTrack [0.784] SiamAPN
[0.586] SiamAPN [0.761] LightTrack
0.2 0.2
[0.583] SiamGAT [0.752] SiamGAT
[0.548] TaMOs-R50 [0.735] TaMOs-R50
0.1 [0.490] SESiamFC 0.1 [0.730] SESiamFC
[0.455] Ocean [0.634] Ocean
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 25 30 35 40 45 50
Overlap threshold Location error threshold
0.8 0.8
0.7 0.7
[0.665] MPBTrack [0.874] ARTrack
[0.659] ARTrack [0.864] MPBTrack
0.6 [0.647] TransT 0.6 [0.859] TransT
Success rate
Precision
Success
0.62 0.82 SiamGAT
SiamGAT
0.60 0.80
SLT-SiamRPN++ SLT-SiamRPN++
SLT-SiamAttn
0.78 SLT-SiamAttn
0.58
LightTrack
SiamBAN
SiamBAN 0.76 LightTrack
0.56
0.74
0.54
0 10 20 30 40 50 0 10 20 30 40 50
FPS FPS
Figure 18. Qualitative evaluation results. From top to bottom, the sequences are car1, group2,
person14, uav180, uav1, group1, and uav93.
5. Conclusions
This paper proposes a visual target-tracking algorithm based on object prediction and
block search, aiming to solve the problem of tracking drift in UAV view. Specifically, when
the tracker experiences tracking drift due to object occlusion or moving out of view, our
approach predicts the motion state of the object using a Kalman filter. Then, the proposed
block search module efficiently searches for the tracking of the drifting target. In addition,
to enhance the adaptability of the tracker in changing scenarios, we propose a dynamic
template update network. This network employs the optimal template strategy based on
various tracking conditions to improve the tracker’s robustness. Finally, we introduce three
evaluation metrics: APCE, SCR, and TS. These metrics are used to identify tracking drift
status and provide prior information for object tracking in subsequent frames. Extensive
experiments and comparisons with many competitive algorithms on five aerial benchmarks,
namely, UAV123, UAV20L, UAVDT, DTB70, and VisDrone2018-SOT, have demonstrated
the effectiveness of our approach in resisting tracking drift in a complex UAV viewpoint
environment, and it achieved a real-time speed of 43 FPS.
Author Contributions: L.S. and X.L. conceived of the idea and developed the proposed approaches.
Z.Y. advised the research. D.G. helped edit the paper. All authors have read and agreed to the
published version of the manuscript.
Drones 2024, 8, 252 27 of 29
Funding: This research was funded by the National Natural Science Foundation of China (No.
62271193), the Aeronautical Science Foundation of China (No. 20185142003), Natural Science Foun-
dation of Henan Province, China (No. 222300420433), Science and Technology Innovative Talents
in Universities of Henan Province, China (No. 21HASTIT030), Young Backbone Teachers in Univer-
sities of Henan Province, China (No. 2020GGJS073), and Major Science and Technology Projects of
Longmen Laboratory (No. 231100220200).
Data Availability Statement: Code and data are available upon request from the authors.
Conflicts of Interest: Author Z.Y. was employed by the company Xiaomi Technology Co., Ltd., The
remaining authors declare that the research was conducted in the absence of any commercial or
financial relationships that could be construed as a potential conflict of interest.
References
1. Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53.
2. Han, Y.; Yu, X.; Luan, H.; Suo, J. Event-Assisted Object Tracking on High-Speed Drones in Harsh Illumination Environment.
Drones 2024, 8, 22.
3. Chen, Q.; Liu, J.; Liu, F.; Xu, F.; Liu, C. Lightweight Spatial-Temporal Contextual Aggregation Siamese Network for Unmanned
Aerial Vehicle Tracking. Drones 2024, 8, 24.
4. Memon, S.A.; Son, H.; Kim, W.G.; Khan, A.M.; Shahzad, M.; Khan, U. Tracking Multiple Unmanned Aerial Vehicles through
Occlusion in Low-Altitude Airspace. Drones 2023, 7, 241.
5. Gao, Y.; Gan, Z.; Chen, M.; Ma, H.; Mao, X. Hybrid Dual-Scale Neural Network Model for Tracking Complex Maneuvering UAVs.
Drones 2023, 8, 3.
6. Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960, 82, 35–45.
7. Xie, X.; Xi, J.; Yang, X.; Lu, R.; Xia, W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones
2023, 7, 296.
8. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In
Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic,
27 September–1 October 2021; pp. 3086–3092.
9. Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783.
10. Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552.
11. Cheng, S.; Zhong, B.; Li, G.; Liu, X.; Tang, Z.; Li, X.; Wang, J. Learning to filter: Siamese relation network for robust tracking. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021;
pp. 4421–4431.
12. Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. Dropmae: Masked autoencoders with spatial-attention dropout for tracking
tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24
June 2023; pp. 14561–14571.
13. Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf.
Process. Syst. 2022, 35, 16743–16754.
14. Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of
the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022;
pp. 146–164.
15. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision–ECCV
2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany,
2016; pp. 445–461.
16. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark:
Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14
September 2018; pp. 370–386.
17. Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of
the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
18. Wen, L.; Zhu, P.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Liu, C.; Cheng, H.; Liu, X.; Ma, W.; et al. Visdrone-sot2018: The vision
meets drone single-object tracking challenge results. In Proceedings of the European Conference on Computer Vision (ECCV)
Workshops, Munich, Germany, 8–14 September 2018.
19. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980.
20. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 1–9.
Drones 2024, 8, 252 28 of 29
21. Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual
tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA,
18–24 June 2020; pp. 6269–6277.
22. Paul, M.; Danelljan, M.; Mayer, C.; Van Gool, L. Robust visual tracking by segmentation. In Proceedings of the European
Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 571–588.
23. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of
the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117.
24. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457.
25. Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15457–15466.
26. Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In
Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg,
Germany, 2022; pp. 341–357.
27. Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Tctrack: Temporal contexts for aerial tracking. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808.
28. Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-day object tracking for unmanned aerial vehicle. IEEE Trans. Mob. Comput. 2022, 22,
4515–4529.
29. Yang, J.; Gao, S.; Li, Z.; Zheng, F.; Leonardis, A. Resource-efficient RGBD aerial tracking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13374–13383.
30. Luo, Y.; Guo, X.; Dong, M.; Yu, J. RGB-T Tracking Based on Mixed Attention. arXiv 2023, arXiv:2304.04264.
31. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 1–9.
32. Wang, M.; Liu, Y.; Huang, Z. Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4021–4029.
33. Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking
in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018;
pp. 300–317.
34. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale
single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 15–20 June 2019; pp. 5374–5383.
35. Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans.
Pattern Anal. Mach. Intell. 2019, 43, 1562–1577.
36. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252.
37. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
38. Mayer, C.; Danelljan, M.; Yang, M.H.; Ferrari, V.; Van Gool, L.; Kuznetsova, A. Beyond SOT: Tracking Multiple Generic Objects at
Once. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January
2024; pp. 6826–6836.
39. Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706.
40. Kim, M.; Lee, S.; Ok, J.; Han, B.; Cho, M. Towards sequence-level training for visual tracking. In Proceedings of the European
Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 534–551.
41. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135.
42. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6668–6677.
43. Tang, F.; Ling, Q. Ranking-based Siamese visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750.
44. Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. Lighttrack: Finding lightweight neural networks for object tracking via one-shot
architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 20–25 June 2021; pp. 15180–15189.
45. Zhang, D.; Zheng, Z.; Jia, R.; Li, M. Visual tracking via hierarchical deep reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3315–3323.
46. Guo, M.; Zhang, Z.; Fan, H.; Jing, L.; Lyu, Y.; Li, B.; Hu, W. Learning target-aware representation for visual tracking via
informative interactions. arXiv 2022, arXiv:2201.02526.
Drones 2024, 8, 252 29 of 29
47. Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021;
pp. 13339–13348.
48. Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9589–9600.
49. Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the 2021
IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 510–516.
50. Mayer, C.; Danelljan, M.; Bhat, G.; Paul, M.; Paudel, D.P.; Yu, F.; Van Gool, L. Transforming model prediction for tracking. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022;
pp. 8731–8740.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.