1. Introduction
As one of the most fundamental tasks in computer vision, visual object tracking (VOT) aims to locate objects, initialized by a bounding box in the first frame of a video sequence. It has been widely used in many applications such as video surveillance, robotics, and autonomous driving. From a model-based perspective, tracking algorithms have evolved from classical correlation-filter-based models to deep neural networks due to their powerful feature representation [
1,
2,
3,
4,
5,
6]. In the last few years, transformer-based trackers have shown improved performances due to the development of an attention mechanism that enables the modeling of complex feature interactions [
7,
8,
9]. However, existing single-model trackers do not perform as well in practice as they have done during testing with publicly available datasets, especially in challenging scenarios such as
,
, and
, as shown in
Figure 1; here, poor feature representations and model drifting often lead to tracking failures. Although many attempts have been made to optimize the tracking paradigm for better accuracy and robustness [
10,
11], trackers based only on RGB images are still becoming closer to the upper bound. Another viable approach for addressing the aforementioned problem is incorporating natural language descriptions as an auxiliary modality to enhance the informativeness of features in the tracking process [
12,
13,
14,
15,
16,
17]. Unlike visual features that can be easily influenced, natural language features, describing the target with color, motion, position, and category information, are less sensitive to variations in appearance [
18,
19,
20,
21].
Vision–language tracking is a new research topic [
12,
13,
14,
15,
16,
17] following it being first proposed by Li [
22]. Previous researchers have made attempts to model fusion methods for combining linguistic features and visual features. These approaches include the simple concatenation of visual and linguistic features, employing self-attention within each modality to enhance their interaction, or using cross-attention to promote the interactions between the two modalities [
12,
13,
14,
15,
17]. Specifically, linguistic features are typically embedded using natural language processing (NLP) models [
12] and then concatenated with visual features extracted from CNN models. However, simple concatenation may ignore the relationship between visual and linguistic features during the fusion process. Fused features consist of visual features and linguistic features; however, fused features lack the semantic similarity relation that exists between visual and linguistic features, resulting in less informative feature representation. In fact, there are certain unexplored challenges within the vision–language tracking field. This study aims to establish a coherent connection between visual and linguistic features without diminishing the inherent feature strengths of each individual modality.
We are inspired by the intrinsic reflections of the human brain. Given an image annotated with a natural language description, a human brain can quickly attend to the nouns inferred by the adjectives and adverbs [
23]. This inspiration led us to seek a novel fusion feature representation that assigns higher weights to regions where visual and linguistic features exhibit semantic similarity. Simultaneously, it aims to suppress irrelevant regions to enhance the distinguishability of the target.
A novel feature fusion method is proposed in this paper, termed multimodal features alignment (MFA), to merge visual and linguistic features, aiming to minimize cross-modality differences and enhance the target’s feature representation in tracking tasks. In our proposed method, extracted visual features,
X, and embedding linguistic features,
Q, are fed into a bilinear pooling (BP) [
24] model to obtain a joint expression of multimodal features. Given that a single sentence annotates a set of video frames in public tracking datasets, and considering the density of visual features and the sparsity of linguistic features, we integrate a factorization machine within BP to regulate the sparse weight matrices of these two features [
18,
19]. To mitigate cross-modality discrepancies, we employ soft attention mechanisms twice during the feature fusion phase. The first soft attention is utilized to ascertain the weight of each word in the natural language description, while the second soft attention generates the distribution of visual–spatial grid weights. The linguistic weights map is combined with the original linguistic feature to derive the linguistic attentional feature
. Subsequently,
acts as a “
” to fuse with the visual feature through a factorized bilinear pooling model, followed by additional soft attention layers to generate the distribution of visual–spatial grid weights.
In the final steps, the visual weights map is combined with the original visual feature to produce the visual attentional feature
.
and
are input into a second factorized bilinear pooling process to generate the fused feature
. Our approach employs a Siamese-based network as the backbone, similar to many tracking models, with tracking inference involving a classification head and a regression head [
25,
26]. In most tracking-by-detection models, the effectiveness of the classification and positive/negative ratio in the dataset significantly impacts the classification and regression outcomes, potentially leading to ambiguous results. In our cases, the fused feature map
acts as the search region instead of a resized image, thereby modifying the anchor-based tracking paradigm to predict the target’s location.
The main contributions of this paper are as follows:
A novelty multimodal features alignment network (MFA) for vision–language visual tracking is proposed, which is used in modeling a semantic relationship between visual features and annotated language descriptions to generate fusion features. Experiments are conducted on three natural-language-annotated datasets, and our tracker exhibits a good performance compared to that of a state-of-the-art tracker.
Weighted fusion features are used as input to the tracking procedure instead of traditional resized images. The fused feature map is divided into multiple grids with distributed weights. To the best of our knowledge, this is the first study to use fusion feature maps instead of traditional search images for a tracking network.
The proposed loss function minimizes the cross-modality discrepancy between visual and natural language features, reflecting the effectiveness of the fusion in the tracking procedure.
2. Related Work
In this section, we present relevant research in the fields of visual object tracking, vision–language fusion modules, and vision–language object tracking, respectively.
2.1. Visual Object Tracking
Correlation-filter-based tracking models aim to design a filter that can be applied on a target when the response is maximized; they are most commonly used in visual object tracking in early stages [
27]. Siamese-based trackers have received increasing attention from researchers due to their deep neural networks, which have powerful capabilities in feature representation. SiameseFC [
3] was the first model of this kind to be proposed, and SiameseRPN [
1] was proposed, bringing a region proposal network into tracking and being used in developing tracking-by-detection methods. Subsequently, several improved versions have been developed to enhance different aspects. SiameseRPN++ [
2] and SiamDW [
28] aim to improve feature representation with deeper and wider networks. ATOM [
29], DiMP [
30], and PrDiMP [
13] are proposed to enable targets to be more easily discriminated from their backgrounds. Recently, transformer-based trackers have outperformed CNN-based models in many benchmarks [
8,
31,
32,
33,
34]. Their advantage lies in their ability to model long-range dependencies and capture complex relations in sequential data. From the scenario attribute, Dai [
35] proposed a meta-updater (LTMU) model for learning geometric cues, appearance cues, and easily discriminated cues, adaptively, in long-term tracking. Moreover, many strategic adjustments [
10,
11] are conducted to optimize tracking models for faster and more robust performance. It can be concluded that two-stream paradigm trackers still play an important role in tracking models, and a representative feature, particularly in challenging cases, is critical for successful tracking.
2.2. Vision–Language Fusion Model
Before being applied in visual tracking, vision–language fusion models have commonly been used in audio–visual speech recognition (AVSR) applications [
36] and image retrieval [
37] and video question-answering tasks [
38]. In recent years, transformer-based models have become the preferred architecture for multimodal pretraining due to their excellent capacity for use in modeling global dependencies [
9]. Lu [
39] propose ViLBERT for use in inputting linguistic features and visual features into transformer encoders; they adopted a common attention mechanism to fuse heterogeneous information. However, they ignored feature distribution within the single modality itself, as well as the interaction between multiple modalities. Similarly, Visual BERT [
40] and Unicoder-VL [
41] also implement BERT and transformer networks for feature extraction and integration purposes. Kim [
42] encoded images and language through an image encoder and a language encoder, respectively, and then integrated visual and linguistic features through a multimodal encoder by cross-model attention. RoBERTa [
43] is designed for pretraining natural language processing (NLP) systems, improving on Bidirectional Encoder Representations from Transformers, i.e., BERT. However, the visual encoders mentioned above are often limited by annotation information such as category labels or bounding boxes for specific tasks, which hinders the generalization performance of the models. Moreover, language embedding models require pretraining on large datasets; then, they must be fine-tuned for different downstream tasks using smaller test datasets. In comparison, vision–language tracking annotated descriptions are relatively small. This may lead to a misalignment between visual and linguistic features. Furthermore, such a paradigm is both time-consuming and expensive in computational costs; this is due to the huge number of learnable parameters.
In tracking tasks, visual features change between different frames, while linguistic features remain unchanged. Methods to reduce the cross-modality discrepancy of visual and linguistic features at the semantic level deserves further exploration.
2.3. Vision–Language Object Tracking
As a new topic in computer vision, vision–language visual tracking has attracted a lot of attention from researchers in recent years especially [
7,
12,
13,
14,
15,
16,
22], along with the rapid development of natural language processing. Li [
22] was the first to apply the fusion of vision–language features in a tracking task. In addition, the OTB-99LANG dataset was released, which is the first vision–language tracking dataset. Wang [
17] proposed a structure-aware graph convolutional neural network, using graph nodes representing samples and edges modeling the spatiotemporal relationships between samples. However, the heterogeneity of the two modalities was not considered, and the added features were not used in the final candidate proposal. Wang also proposed [
13] a vision–language tracker based on AdaSwitch model to avoid BBox drifting away. Visual grounding is introduced to the tracking procedure, and the linguistic feature was used to initialize the target instead of BBox. A large natural-language-annotated object tracking dataset, TNL2K, has been published. Feng [
16] presented a track-by-detection route that identified the most likely image region given the NL description. He then proposed a Siamese natural language RPN++ tracker named SNLT [
14]. The correlation process can be duplicated three times and the correlation results are aggregated dynamically by using joint conditional probabilities between the language network and the vision network. Guo [
44] proposed a ModaMixer and asymmetrical networks to learn a unified-adaptive vision–language representation. Zhao [
7] presented a transformer-based tracking network, using a proxy token to guide the cross-modal attention. The proxy token is used to modulate word embeddings and make them attend to visual features. Yet, the language token has not been attended to by visual features. JointNLT [
45] unifies the cross-modality relation and the cross-temporal relation among natural language and image, which can be implemented in both grounding tasks and tracking tasks.
Previous works have illustrated that language descriptions, acting as auxiliary modality information, enable improvements to be made in the tracking performance, if they are exploited correctly. However, the fusion approach has more possibilities beyond simple concatenation. In this paper, we learn a fused feature representation by aligning a target-relevant linguistic feature and a visual feature, achieving better results.
3. Methods
In this section, we introduce the details of our new vision–language tracking framework, as shown in
Figure 2. Specifically, we first describe the multimodal features alignment module for generating a vision–language feature representation with factorized bilinear pooling and a co-attention mechanism. Then, the fused features act as search regions, to be fed into the Siamese-based network for tracking. The definitions of the variables are listed in
Table 1.
3.1. Vision–Language Feature Factorized Bilinear Module
In our proposed method, visual features and linguistic features are first projected in a latent space to obtain a joint representation by implementing factorized bilinear pooling.
Consider
and
as pairwise features; a common linear transformation can be represented as:
where
is a projection matrix and
is the output of the bilinear model as shown in
Figure 3. The bias term is omitted here since it is implicit in
W [
24]. Equation (
1) can be rewritten as:
Specifically, to obtain a o-dimensional output, F, we need to learn . Although the bilinear pooling is capable of capturing pairwise interactions, it also embraces a quadratic number of parameters in the projection matrix, .
Due to the matrix factorization tricks for uni-modal data, the projection matrix
in Equation (
2) can be factorized as two low-rank matrices, as follows:
where
,
k is the factor or the latent dimensionality of the factorized matrices
and
, ∘ is the Hadamard product or the element-wise multiplication of two vectors.
Given a video sequence,
is the search region at frame
t. A visual feature,
, is extracted through the CNNs model; at the same time, a natural language description is embedded through GloVe, followed by a two-layer LSTM, to generate the linguistic feature,
. The soft attention mechanism is conducted for the visual feature and the linguistic feature, respectively, to obtain their inner modality possibility distribution. Specifically, for linguistic feature
, we apply two convolutional layers and the softmax function to predict each word’s attention weight,
. Then, we take a weighted sum of the linguistic feature and create an attentional map to output the linguistic attentional feature,
. On the other hand, the visual feature is merged with the linguistic attentional feature using factorized bilinear pooling; this is followed by two convolutional layers and softmax normalization to obtain an attention distribution over the search region,
, as shown in Equation (
3). Without the loss of generality, we can reformulate
U and
V as the 2D matrices
and
, respectively, with simple reshape operations, as shown in Equation (
3). Accordingly,
and
can be formulated as follows:
Here, we conduct linguistic attentional feature as “
” to bridge the linguistic feature and the visual feature, and pinpoint the highly relevant region in search image. The visual weights vectors for each grid represent the attention distribution brought about by the natural language descriptions. Finally, we take the weighted sums of the visual feature vectors and the visual weights to generate the attended visual feature,
, conditioned by
Q. The training algorithm is shown in Algorithm 1.
Algorithm 1: Curriculum training for the tracker. |
|
3.2. Vision–Language Feature Co-Attention Module
We employ the co-attention method to align the visual feature and the linguistic feature. It can be specified into two steps, self-attention for a natural language description of target embedding and the description conditioned attention for a visual embedding.
The attention mechanism uses an attention probability distribution
over the
G grid space.
is defined as:
where
; the softmax function applies to each row vector of
. The bias terms are omitted for simplicity. The weight matrix
of the visual attentional feature,
, is a linear combination of
with coefficients
. Each attention probability distribution
is for a glimpse,
g. For
,
is the concatenation of each glimpse vector, as follows:
After that, the linguistic attention feature and the visual attention feature,
, are fed into another factorized bilinear pooling module to generate a joint feature representation,
.
3.3. Prediction Head
In the tracking procedure, we take the Siamese-based two-stream network as the tracking backbone, and invoke a grid-wise target localization strategy, as performed in [
46]. The fused feature,
, generated by MFA, is reshaped by transposed convolution, then fed into the tracking network as the search region;
is the visual template region. The inference algorithm is shown in Algorithm 2. Both branches share parameters through the backbone network, which applies the same transformation to embed
and
into a shared feature space.
where
is feature extractor;
is for the
and
tasks. The feature maps are fine-tuned by two convolutional layers between
and
.
Algorithm 2: Inference of the proposed tracker. |
|
The classification head takes
as input. If
on the feature map, then
is regarded as a positive sample when its corresponding position
on the input image is within the BBox. Here,
q is the step size of the backbone. The regression head takes
as input, and the output offsets to optimize the prediction of the bounding box position. For each positive sample,
, the last layer of the regression head predicts the distance to the ground truth and obtains
:
where
and
represent the top left and the bottom right, respectively, and
q represents the backbone step size.
It is assumed that the feature pixels around the center of the target have better estimation quality than other pixels. A
convolution layer is added in parallel with the classification head for quality estimation. This output is used to estimate the prior spatial score; this is defined as follows:
3.4. Loss Function
We detail the training objective function following the recently proposed anchor-free tracking methods [
46]. For classification, the sub-task focal loss approach is employed.
For quality assessment, BCE loss is selected, since we consider it to be a binary classification.
For the regression sub-task, the
loss is employed for the bounding box:
where 1 is the indicator function that takes 1 to
if
is considered as a positive sample, and is 0 is considered as a negative sample.
Finally, the overall loss is defined as:
5. Conclusions
We present a novel multimodal features alignment (MFA) approach for vision–language object tracking. Unlike prior attempts, we adopt factorized bilinear pooling and an attention mechanism to align visual features and linguistic features by constraining them based on semantic similarities. The weighted fusion feature map takes cross-correlation with the template region in grid-wise level, avoiding ambiguity from anchor boxes. Experiments on public datasets demonstrate that our Siamese MFA tracker achieves competitive results compared with other state-of-the-art vision–language trackers using standard evaluation metrics. When appearance dramatically changes—i.e., through rotation, aspect ratio change, and fast motion scenarios or there is a complex context, i.e., in illumination variation or out of view scenarios—our proposed tracker attains a satisfactory performance even compared with other state-of-the-art RGB trackers. The outstanding performance shows that our multimodal feature alignment is effective in vision–language tracking. Over all, we expect that our method will be useful in enhancing the performance of vision–language fusion trackers. In future works, we will focus on the particular information of visual features and linguistic features, especially for the modeling relationship of inter-modality and inner-modality.