Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3663976.3664237acmotherconferencesArticle/Chapter ViewAbstractPublication PagescvipprConference Proceedingsconference-collections
research-article

SwapTrack: Enhancing RGB-T Tracking via Learning from Paired and Single-Modal Data

Published: 27 June 2024 Publication History

Abstract

RGB-T tracking leverages the complementary information from both RGB and thermal modalities, enhancing tracking robustness and accuracy in challenging visual conditions. Most existing RGB-T trackers primarily rely on paired RGB-T data for training. However, the availability of paired RGB-T images is limited compared to the abundance of single-modal RGB or TIR images. To fully leverage the potential of all available data, we propose SwapTrack, an RGB-T tracker that effectively learns from both paired RGB-T data and single-modal data. The proposed approach incorporates three key designs: shared and separated networks to extract modality-shared and modality-specific patterns, the swapped projection network for modal feature conversion and complementation, and a two-step training scheme to learn from single-modal and multi-modal data effectively. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches when trained with RGB-T+RGB+TIR datasets. Furthermore, ablation studies reveal that the proposed method demonstrates notable performance enhancements when introducing additional single-modal data. This finding underscores the effectiveness of incorporating single-modal data into the training process.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[3]
Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, 2021. Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision 129 (2021), 439–461.
[4]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2021. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2021), 1562–1577.
[5]
Tianrui Hui, Zizheng Xun, Fengguang Peng, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, and Si Liu. 2023. Bridging Search Region Interaction With Template for RGB-T Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13630–13639.
[6]
Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. 2023. Multimodal Prompting with Missing Modalities for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14943–14952.
[7]
Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. 2016. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing 25, 12 (2016), 5743–5756.
[8]
Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition 96 (2019), 106977.
[9]
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing 31 (2021), 392–404.
[10]
Qiao Liu, Zhenyu He, Xin Li, and Yuan Zheng. 2019. PTB-TIR: A thermal infrared pedestrian tracking benchmark. IEEE Transactions on Multimedia 22, 3 (2019), 666–675.
[11]
Qiao Liu, Xin Li, Zhenyu He, Chenglong Li, Jun Li, Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, 2020. LSOTB-TIR: A large-scale high-diversity thermal infrared object tracking benchmark. In Proceedings of the 28th ACM international conference on multimedia. 3847–3856.
[12]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
[13]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[14]
Andong Lu, Chenglong Li, Yuqing Yan, Jin Tang, and Bin Luo. 2021. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing 30 (2021), 5613–5625.
[15]
Andong Lu, Cun Qian, Chenglong Li, Jin Tang, and Liang Wang. 2022. Duality-Gated Mutual Condition Network for RGBT Tracking. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–14.
[16]
Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. 2021. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302–2310.
[17]
Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV). 300–317.
[18]
Shen Qing, Wang Yifan, Guo Yu, and Mengmeng Yang. 2023. Unveiling the power of unpaired multi-modal data for RGBT tracking. In Fourth International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2023), Vol. 12709. SPIE, 672–677.
[19]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
[20]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[21]
Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. 2023. Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15878–15887.
[22]
Yun Xiao, Mengmeng Yang, Chenglong Li, Lei Liu, and Jin Tang. 2022. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2831–2838.
[23]
Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sunderhauf. 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8514–8523.
[24]
Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sunderhauf. 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8514–8523.
[25]
Hui Zhang, Lei Zhang, Li Zhuo, and Jing Zhang. 2020. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors 20, 2 (2020), 393.
[26]
Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost van de Weijer, and Fahad Shahbaz Khan. 2019. Multi-Modal Fusion for End-to-End RGB-T Tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). 2252–2261.
[27]
Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient RGB-T Tracking via Cross-Modality Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5404–5413.
[28]
Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. 2020. Object fusion tracking based on visible and infrared images: A comprehensive review. Information Fusion 63 (2020), 166–187.

Index Terms

  1. SwapTrack: Enhancing RGB-T Tracking via Learning from Paired and Single-Modal Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CVIPPR '24: Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition
    April 2024
    373 pages
    ISBN:9798400716607
    DOI:10.1145/3663976
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. Multimodal feature fusion
    3. RGB-T tracking
    4. Single object tracking

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Major Program of National Natural Science Foundation of China
    • Science Strength Promotion Program of UESTC

    Conference

    CVIPPR 2024

    Acceptance Rates

    Overall Acceptance Rate 14 of 38 submissions, 37%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 39
      Total Downloads
    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 06 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media