Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Sun, Hongze; Liu, Rui; Cai, Wuque; Wang, Jun; Wang, Yue; Tang, Huajin; Cui, Yan; Yao, Dezhong; Guo, Daqing

doi:10.1016/j.neunet.2024.106493

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.17903 (cs)

[Submitted on 28 May 2024]

Title:Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Authors:Hongze Sun, Rui Liu, Wuque Cai, Jun Wang, Yue Wang, Huajin Tang, Yan Cui, Dezhong Yao, Daqing Guo

View PDF

Abstract:Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches usually integrate multimodal inputs through adaptive local feature interactions, which cannot leverage the full potential of visual cues, thus resulting in insufficient feature modeling. In this study, we propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities and then uses a unified encoder to align the features across different domains. Moreover, we propose an enhanced transformer-based module to fuse multimodal features using attention mechanisms. With these methods, the MMHT model can effectively construct a multiscale and multidimensional visual feature space and achieve discriminative feature modeling. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with that of other state-of-the-art methods. Overall, our results highlight the effectiveness of the MMHT model in terms of addressing the challenges faced in visual object tracking tasks.

Comments:	16 pages, 7 figures, 9 tabes; This work has been submitted for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
Cite as:	arXiv:2405.17903 [cs.CV]
	(or arXiv:2405.17903v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.17903
Related DOI:	https://doi.org/10.1016/j.neunet.2024.106493

Submission history

From: Daqing Guo [view email]
[v1] Tue, 28 May 2024 07:24:56 UTC (2,054 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators