research-article

SwapTrack: Enhancing RGB-T Tracking via Learning from Paired and Single-Modal Data

Authors:

Duanbing ChenAuthors Info & Claims

CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

Article No.: 56, Pages 1 - 7

https://doi.org/10.1145/3663976.3664237

Published: 27 June 2024 Publication History

Abstract

RGB-T tracking leverages the complementary information from both RGB and thermal modalities, enhancing tracking robustness and accuracy in challenging visual conditions. Most existing RGB-T trackers primarily rely on paired RGB-T data for training. However, the availability of paired RGB-T images is limited compared to the abundance of single-modal RGB or TIR images. To fully leverage the potential of all available data, we propose SwapTrack, an RGB-T tracker that effectively learns from both paired RGB-T data and single-modal data. The proposed approach incorporates three key designs: shared and separated networks to extract modality-shared and modality-specific patterns, the swapped projection network for modal feature conversion and complementation, and a two-step training scheme to learn from single-modal and multi-modal data effectively. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches when trained with RGB-T+RGB+TIR datasets. Furthermore, ablation studies reveal that the proposed method demonstrates notable performance enhancements when introducing additional single-modal data. This finding underscores the effectiveness of incorporating single-modal data into the training process.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[3]

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, 2021. Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision 129 (2021), 439–461.

Digital Library

[4]

Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2021. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2021), 1562–1577.

[5]

Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. 2023. Bridging Search Region Interaction With Template for RGB-T Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13630–13639.

[6]

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. 2023. Multimodal Prompting with Missing Modalities for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14943–14952.

[7]

Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. 2016. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing 25, 12 (2016), 5743–5756.

Digital Library

[8]

Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition 96 (2019), 106977.

Digital Library

[9]

Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing 31 (2021), 392–404.

[10]

Qiao Liu, Zhenyu He, Xin Li, and Yuan Zheng. 2019. PTB-TIR: A thermal infrared pedestrian tracking benchmark. IEEE Transactions on Multimedia 22, 3 (2019), 666–675.

[11]

Qiao Liu, Xin Li, Zhenyu He, Chenglong Li, Jun Li, Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, 2020. LSOTB-TIR: A large-scale high-diversity thermal infrared object tracking benchmark. In Proceedings of the 28th ACM international conference on multimedia. 3847–3856.

Digital Library

[12]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.

[13]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[14]

Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. 2021. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing 30 (2021), 5613–5625.

[15]

Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. 2022. Duality-Gated Mutual Condition Network for RGBT Tracking. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–14.

[16]

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. 2021. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302–2310.

[17]

Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV). 300–317.

Digital Library

[18]

Shen Qing, Wang Yifan, Guo Yu, and Mengmeng Yang. 2023. Unveiling the power of unpaired multi-modal data for RGBT tracking. In Fourth International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2023), Vol. 12709. SPIE, 672–677.

[19]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.

Digital Library

[20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[21]

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. 2023. Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15878–15887.

[22]

Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. 2022. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2831–2838.

[23]

Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sunderhauf. 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8514–8523.

[24]

Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sunderhauf. 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8514–8523.

[25]

Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. 2020. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 20, 2 (2020), 393.

[26]

Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost van de Weijer, and Fahad Shahbaz Khan. 2019. Multi-Modal Fusion for End-to-End RGB-T Tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). 2252–2261.

[27]

Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient RGB-T Tracking via Cross-Modality Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5404–5413.

[28]

Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. 2020. Object fusion tracking based on visible and infrared images: A comprehensive review. Information Fusion 63 (2020), 166–187.

Index Terms

SwapTrack: Enhancing RGB-T Tracking via Learning from Paired and Single-Modal Data
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking

Recommendations

TandemFuse: An Intra- and Inter-Modal Fusion Strategy for RGB-T Tracking
CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

Visual object tracking is a prominent task in the field of computer vision, with significant potential in autonomous driving, human-computer interaction, and intelligent surveillance. Many studies have focused on tracking using single-modality data. ...
Multi-modal interaction with token division strategy for RGB-T tracking
Abstract
RGB-T tracking takes visible and infrared images as inputs, which is an extended application of multi-modal fusion in the field of visual object tracking. The complementarity between visible and infrared modalities can enhance the robustness of ...
Highlights
- To our knowledge, our method sets new state-of-the-art on several benchmarks.
- We propose a multi-modal interaction scheme based on token division strategy.
- Input tokens are divided into eight categories based on consistency and ...
RGB-T tracking by modality difference reduction and feature re-selection
Highlights
- Novel RGB-T tracking algorithm based on modality difference reduction and effective information re-selection.
Abstract
RGB-T tracking has attracted increasing attention, since visible and thermal data have strong complementary advantages to improve the robustness of trackers. Most existing models focus on investigating efficient ways of fusing the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

April 2024

373 pages

ISBN:9798400716607

DOI:10.1145/3663976

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Major Program of National Natural Science Foundation of China
Science Strength Promotion Program of UESTC

Conference

CVIPPR 2024

CVIPPR 2024: 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition

April 26 - 28, 2024

Xiamen, China

Acceptance Rates

Overall Acceptance Rate 14 of 38 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
39
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)8

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten