research-article

TandemFuse: An Intra- and Inter-Modal Fusion Strategy for RGB-T Tracking

Authors:

Hui LiAuthors Info & Claims

CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

Article No.: 17, Pages 1 - 7

https://doi.org/10.1145/3663976.3663996

Published: 27 June 2024 Publication History

Abstract

Visual object tracking is a prominent task in the field of computer vision, with significant potential in autonomous driving, human-computer interaction, and intelligent surveillance. Many studies have focused on tracking using single-modality data. Among these, the RGB modality is renowned for its robust color and detail capture capabilities, yet it is susceptible to motion blur, occlusions, and low-light conditions. Conversely, the TIR modality can overcome these issues but is limited by lower resolution and higher costs, making target recognition in complex scenes more challenging. With the rise of multi-modal learning, integrating RGB and TIR modalities can significantly enhance the robustness of single-modality trackers across various scenarios. This paper proposes a comprehensive multi-modal learning strategy for fusing RGB and TIR modality in the tracking task. Two key modules are designed: intra-modal data fusion and inter-modal data fusion. For intra-modal data fusion, we utilize feature pyramid techniques to merge multi-scale representations from single modalities. Subsequently, features learned from both modalities are sent to inter-modal data fusion for enhanced tracking. Our model is trained on the RGBT234 dataset and tested on the GTOT dataset, achieving a success rate of 0.454 and a precision rate of 0.438. The insights and methodologies derive from this research offer guidance for future multi-modal object tracking studies and underscore the critical role of intra-modal fusion in enhancing the efficiency of multi-modal integration.

References

[1]

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14. Springer, 850–865.

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[3]

João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2014. High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence 37, 3 (2014), 583–596.

[4]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).

[5]

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4282–4291.

[6]

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971–8980.

[7]

Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. 2016. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing 25, 12 (2016), 5743–5756.

Digital Library

[8]

Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition 96 (2019), 106977.

Digital Library

[9]

Chenglong Li, Lei Liu, Andong Lu, Qing Ji, and Jin Tang. 2020. Challenge-aware RGBT tracking. In European Conference on Computer Vision. Springer, 222–237.

Digital Library

[10]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[11]

Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B Chan. 2023. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14561–14571.

[12]

Pengyu Zhang, Dong Wang, and Huchuan Lu. 2024. Multi-modal visual tracking: Review and experimental comparison. Comput. Vis. Media 10, 2 (2024), 193–214. https://doi.org/10.1007/S41095-023-0345-5

[13]

Pengyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. Int. J. Comput. Vis. 129, 9 (2021), 2714–2729. https://doi.org/10.1007/S11263-021-01495-3

Digital Library

[14]

Pengyu Zhang, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking. IEEE Trans. Image Process. 30 (2021), 3335–3347. https://doi.org/10.1109/TIP.2021.3060862

Digital Library

[15]

Pengyu Zhang, Jie Zhao, Dong Wang, Huchuan Lu, and Xiang Ruan. 2022. Visible-Thermal UAV Tracking: A Large-Scale Benchmark and New Baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 8876–8885. https://doi.org/10.1109/CVPR52688.2022.00868

[16]

Xingchen Zhang, Ping Ye, Shengyun Peng, Jun Liu, Ke Gong, and Gang Xiao. 2019. SiamFT: An RGB-infrared fusion tracking method via fully convolutional Siamese networks. IEEE Access 7 (2019), 122122–122133.

[17]

Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. 2023. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5906–5916.

[18]

Yabin Zhu, Chenglong Li, Bin Luo, Jin Tang, and Xiao Wang. 2019. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia. 465–472.

Digital Library

[19]

Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. 2018. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV). 101–117.

Digital Library

Index Terms

TandemFuse: An Intra- and Inter-Modal Fusion Strategy for RGB-T Tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking

Recommendations

Multi-modal interaction with token division strategy for RGB-T tracking
Abstract
RGB-T tracking takes visible and infrared images as inputs, which is an extended application of multi-modal fusion in the field of visual object tracking. The complementarity between visible and infrared modalities can enhance the robustness of ...
Highlights
- To our knowledge, our method sets new state-of-the-art on several benchmarks.
- We propose a multi-modal interaction scheme based on token division strategy.
- Input tokens are divided into eight categories based on consistency and ...
Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking
Pattern Recognition and Computer Vision
Abstract
Multi-modal tracking has increasingly gained attention due to its superior accuracy and robustness in complex scenarios. The primary challenges in this field lie in effectively extracting and fusing multi-modal data that inherently contain gaps. ...
RGB-T tracking by modality difference reduction and feature re-selection
Highlights
- Novel RGB-T tracking algorithm based on modality difference reduction and effective information re-selection.
Abstract
RGB-T tracking has attracted increasing attention, since visible and thermal data have strong complementary advantages to improve the robustness of trackers. Most existing models focus on investigating efficient ways of fusing the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

April 2024

373 pages

ISBN:9798400716607

DOI:10.1145/3663976

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CVIPPR 2024

CVIPPR 2024: 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

April 26 - 28, 2024

Xiamen, China

Acceptance Rates

Overall Acceptance Rate 14 of 38 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
23
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents