Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3680971acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion

Published: 28 October 2024 Publication History

Abstract

The fusion of visible and infrared images aims to produce high-quality fusion images with rich textures and salient target information. Existing methods lack interactivity and flexibility in the execution of fusion. It is unfeasible to express the requirements to modify the fusion effect, and the different regions in the source images are treated equally across the identical fusion model, which causes fusion homogenization and low distinction. Besides, their pre-defined fusion strategies invariably lead to monotonous effects, which are insufficiently comprehensive. They fail to adequately consider data credibility, scene illumination, and noise degradation inherent in the source information. To address these issues, we propose the Te xt-driven and Region-aware Flexible visible and infrared image fusion, termed as TeRF. On the one hand, we propose a flexible image fusion framework with multiple large language and vision models, which facilitates the visual-text interaction. On the other hand, we aggregate comprehensive fine-tuning paradigms for the different fusion requirements to build a unified fine-tuning pipeline. It allows the linguistic selection of the regions and effects, yielding visually appealing fusion outcomes. Extensive experiments demonstrate the competitiveness of our method both qualitatively and quantitatively compared to existing state-of-the-art methods. Our code is publicly available at https://github.com/Baixuzx7/TeRF.

Supplemental Material

MP4 File - TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion
Video Presentation of TeRF

References

[1]
Guangmang Cui, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. 2015. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Optics Communications, Vol. 341 (2015), 199--209.
[2]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[3]
Yona Falinie A Gaus, Neelanjan Bhowmik, Brian KS Isaac-Medina, Hubert PH Shum, Amir Atapour-Abarghouei, and Toby P Breckon. 2023. Region-based appearance and flow characteristics for anomaly detection in infrared surveillance imagery. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 2995--3005.
[4]
Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. 2022. Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 1140--1156.
[5]
Yu Han, Yunze Cai, Yin Cao, and Xiaoming Xu. 2013. A new image fusion performance metric based on visual information fidelity. Information Fusion, Vol. 14, 2 (2013), 127--135.
[6]
Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the International Conference on Computer Vision. 3496--3504.
[7]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the International Conference on Computer Vision. 4015--4026.
[8]
Chongyi Li, Chunle Guo, Linghao Han, Jun Jiang, Ming-Ming Cheng, Jinwei Gu, and Chen Change Loy. 2022. Low-light image and video enhancement using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 12 (2022), 9396--9416.
[9]
Guofa Li, Yongjie Lin, and Xingda Qu. 2021. An infrared and visible image fusion method based on multi-scale transformation and norm optimization. Information Fusion, Vol. 71 (2021), 109--129.
[10]
Hui Li, Xiao-Jun Wu, and Josef Kittler. 2021. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Information Fusion, Vol. 73 (2021), 72--86.
[11]
Hui Li, Tianyang Xu, Xiao-Jun Wu, Jiwen Lu, and Josef Kittler. 2023. LRRNet: A Novel Representation Learning Guided Fusion Network for Infrared and Visible Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 9 (2023), 11040--11052.
[12]
Juncheng Li, Faming Fang, Kangfu Mei, and Guixu Zhang. 2018. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision. 517--532.
[13]
Yuqi Li, Haitao Zhao, Zhengwei Hu, Qianqian Wang, and Yuru Chen. 2020. IVFuseNet: Fusion of infrared and visible light images for depth prediction. Information Fusion, Vol. 58 (2020), 1--12.
[14]
Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 5802--5811.
[15]
Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. 2023. Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation. In Proceedings of the International Conference on Computer Vision. 8115--8124.
[16]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
[17]
Yongzhi Long, Haitao Jia, Yida Zhong, Yadong Jiang, and Yuming Jia. 2021. RXDNFuse: A aggregated residual dense network for infrared and visible image fusion. Information Fusion, Vol. 69 (2021), 128--141.
[18]
Jiayi Ma, Pengwei Liang, Wei Yu, Chen Chen, Xiaojie Guo, Jia Wu, and Junjun Jiang. 2020. Infrared and visible image fusion via detail preserving adversarial learning. Information Fusion, Vol. 54 (2020), 85--98.
[19]
Jiayi Ma, Yong Ma, and Chang Li. 2019. Infrared and visible image fusion methods and applications: A survey. Information Fusion, Vol. 45 (2019), 153--178.
[20]
Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. 2022. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA Journal of Automatica Sinica, Vol. 9, 7 (2022), 1200--1217.
[21]
Jiayi Ma, Linfeng Tang, Meilong Xu, Hao Zhang, and Guobao Xiao. 2021. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement, Vol. 70 (2021), 1--13.
[22]
Jiayi Ma, Han Xu, Junjun Jiang, Xiaoguang Mei, and Xiao-Ping Zhang. 2020. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, Vol. 29 (2020), 4980--4995.
[23]
Jiayi Ma, Wei Yu, Pengwei Liang, Chang Li, and Junjun Jiang. 2019. FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion, Vol. 48 (2019), 11--26.
[24]
Youssef Mansour and Reinhard Heckel. 2023. Zero-Shot Noise2Noise: Efficient Image Denoising without any Data. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 14018--14027.
[25]
Juan Diego Ortega, Neslihan Kose, Paola Ca nas, Min-An Chao, Alexander Unnervik, Marcos Nieto, Oihana Otaegui, and Luis Salgado. 2020. Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In Proceedings of the European Conference on Computer Vision. 387--405.
[26]
Yujing Rao, Dan Wu, Mina Han, Ting Wang, Yang Yang, Tao Lei, Chengjiang Zhou, Haicheng Bai, and Lin Xing. 2023. AT-GAN: A generative adversarial network with attention and transition for infrared and visible image fusion. Information Fusion, Vol. 92 (2023), 336--349.
[27]
Linfeng Tang, Xinyu Xiang, Hao Zhang, Meiqi Gong, and Jiayi Ma. 2023. DIVFusion: Darkness-free infrared and visible image fusion. Infusion Fusion, Vol. 91 (2023), 477--493.
[28]
Linfeng Tang, Jiteng Yuan, Hao Zhang, Xingyu Jiang, and Jiayi Ma. 2022. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Information Fusion, Vol. 83 (2022), 79--92.
[29]
Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. 2023. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Information Fusion, Vol. 99 (2023), 101870.
[30]
Wei Tang, Fazhi He, and Yu Liu. 2023. TCCFusion: An infrared and visible image fusion method based on transformer and cross correlation. Pattern Recognition, Vol. 137 (2023), 109295.
[31]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[32]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, Vol. 13, 4 (2004), 600--612.
[33]
Zhishe Wang, Yanlin Chen, Wenyu Shao, Hui Li, and Lei Zhang. 2022. SwinFuse: A residual swin transformer fusion network for infrared and visible images. IEEE Transactions on Instrumentation and Measurement, Vol. 71 (2022), 1--12.
[34]
Zhishe Wang, Junyao Wang, Yuanyuan Wu, Jiawei Xu, and Xiaoqin Zhang. 2022. UNFusion: A unified multi-scale densely connected network for infrared and visible image fusion. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 6 (2022), 3360--3374.
[35]
Zhishe Wang, Yuanyuan Wu, Junyao Wang, Jiawei Xu, and Wenyu Shao. 2022. Res2Fusion: Infrared and visible image fusion based on dense Res2net and double nonlocal attention models. IEEE Transactions on Instrumentation and Measurement, Vol. 71 (2022), 1--12.
[36]
Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. 2020. FusionDN: A Unified Densely Connected Network for Image Fusion. In Proceedings of the AAAI Conference on Artificial Intelligence. 12484--12491.
[37]
Costas S Xydeas, Vladimir Petrovic, et al. 2000. Objective image fusion performance measure. Electronics Letters, Vol. 36, 4 (2000), 308--309.
[38]
Xunpeng Yi, Linfeng Tang, Hao Zhang, Han Xu, and Jiayi Ma. 2024. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior. Information Fusion, Vol. 110 (2024), 102450.
[39]
Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2024. Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 27026--27035.
[40]
Hao Zhang and Jiayi Ma. 2021. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. International Journal of Computer Vision, Vol. 129 (2021), 2761--2785.
[41]
Hao Zhang, Han Xu, Xin Tian, Junjun Jiang, and Jiayi Ma. 2021. Image fusion meets deep learning: A survey and perspective. Information Fusion, Vol. 76 (2021), 323--336.
[42]
Xingchen Zhang and Yiannis Demiris. 2023. Visible and infrared image fusion using deep learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 8 (2023), 10535--10554.
[43]
Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. 2020. Object fusion tracking based on visible and infrared images: A comprehensive review. Information Fusion, Vol. 63 (2020), 166--187.
[44]
Xingchen Zhang, Ping Ye, and Gang Xiao. 2020. VIFB: A visible and infrared image fusion benchmark. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops. 104--105.
[45]
Zeyang Zhang, Hui Li, Tianyang Xu, Xiao-Jun Wu, and Yu Fu. 2024. GuideFuse: A novel guided auto-encoder fusion network for infrared and visible images. IEEE Transactions on Instrumentation and Measurement, Vol. 73 (2024), 5011211.
[46]
Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. 2023. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 5906--5916.
[47]
Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. 2023. DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion. In Proceedings of the International Conference on Computer Vision. 8082--8093.
[48]
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang, and Pengfei Li. 2021. DIDFuse: deep image decomposition for infrared and visible image fusion. In Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence. 976--976.
[49]
Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang, and Junmin Liu. 2022. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2022), 1186--1196.
[50]
Naishan Zheng, Man Zhou, Jie Huang, and Feng Zhao. 2024. Frequency Integration and Spatial Compensation Network for infrared and visible image fusion. Information Fusion, Vol. 109 (2024), 102359.
[51]
Anqi Zhu, Lin Zhang, Ying Shen, Yong Ma, Shengjie Zhao, and Yicong Zhou. 2020. Zero-shot restoration of underexposed images via robust retinex decomposition. In Proceedings of the International Conference on Multimedia and Expo. 1--6.
[52]
Zhengjie Zhu, Xiaogang Yang, Ruitao Lu, Tong Shen, Xueli Xie, and Tao Zhang. 2022. Clf-net: Contrastive learning for infrared and visible image fusion network. IEEE Transactions on Instrumentation and Measurement, Vol. 71 (2022), 1--15.

Cited By

View all
  • (2024)TextFusion: Unveiling the power of textual semantics for controllable image fusionInformation Fusion10.1016/j.inffus.2024.102790(102790)Online publication date: Nov-2024
  • (2024)MSL-CCRN: Multi-stage Self-supervised Learning Based Cross-modality Contrastive Representation Network for Infrared and Visible Image FusionDigital Signal Processing10.1016/j.dsp.2024.104853(104853)Online publication date: Nov-2024

Index Terms

  1. TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fine-tuning
    2. image fusion
    3. large models
    4. text-driven

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)105
    • Downloads (Last 6 weeks)88
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)TextFusion: Unveiling the power of textual semantics for controllable image fusionInformation Fusion10.1016/j.inffus.2024.102790(102790)Online publication date: Nov-2024
    • (2024)MSL-CCRN: Multi-stage Self-supervised Learning Based Cross-modality Contrastive Representation Network for Infrared and Visible Image FusionDigital Signal Processing10.1016/j.dsp.2024.104853(104853)Online publication date: Nov-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media