Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing

Published: 24 August 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to “scale decoupling” and “semantic decoupling” strategies to further enhance the capability of representation. However, these previous approaches focus on either the disentangling scale or semantics but ignore merging these two ideas in a union model, which extremely limits the performance of cross-modal retrieval models. To address these issues, we propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote sensing image-text retrieval. Specifically, we design the Bidirectional Scale Decoupling (BSD) module, which exploits Salience Extraction Map (SEM) and Salience Suppression Map (SSM) units to adaptively extract potential features and suppress cumbersome features at other scales in a bidirectional pattern to yield different scale clues. Besides, we design the Label-supervised Semantic Decoupling (LSD) module by leveraging the category semantic labels as prior knowledge to supervise images and texts probing significant semantic-related information. Finally, we design a Semantic-guided Triple Loss (STL), which adaptively generates a constant to adjust the loss function to improve the probability of matching the same semantic image and text and shorten the convergence time of the retrieval model. Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.

    References

    [1]
    Taghreed Abdullah, Yakoub Bazi, Mohamad M. Al Rahhal, Mohamed L. Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.
    [2]
    Cong Bai, Minjing Zhang, Jinglin Zhang, Jianwei Zheng, and Shengyong Chen. 2021. LSCIDMR: Large-scale satellite cloud image database for meteorological research. IEEE Transactions on Cybernetics (2021).
    [3]
    Cong Bai, Dongxiaoyuan Zhao, Minjing Zhang, and Jinglin Zhang. 2022. Multimodal information fusion for weather systems and clouds identification from satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 7333–7345.
    [4]
    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 801–818.
    [5]
    Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 522–531.
    [6]
    Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, and Liang Zhang. 2021. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 4284–4297.
    [7]
    Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
    [8]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [9]
    Genc Hoxha, Farid Melgani, and Begüm Demir. 2020. Toward remote sensing image retrieval under a deep image captioning perspective. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 4462–4475.
    [10]
    Genc Hoxha, Farid Melgani, and Jacopo Slaghenauffi. 2020. A new CNN-RNN framework for remote sensing image captioning. In Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS). IEEE, 1–4.
    [11]
    Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.
    [12]
    Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302.
    [13]
    Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201–216.
    [14]
    Xuelong Li, Xueting Zhang, Wei Huang, and Qi Wang. 2020. Truncation cross entropy loss for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 59, 6 (2020), 5246–5257.
    [15]
    Chao Liu, Jingjing Ma, Xu Tang, Xiangrong Zhang, and Licheng Jiao. 2019. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the -2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2019). IEEE, 4324–4327.
    [16]
    Yishu Liu, Conghui Chen, Zhengzhuo Han, Liwang Ding, and Yingbin Liu. 2020. High-resolution remote sensing image retrieval based on classification-similarity networks and double fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 1119–1133.
    [17]
    Yishu Liu, Liwang Ding, Conghui Chen, and Yingbin Liu. 2020. Similarity-based unsupervised deep transfer learning for remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing 58, 11 (2020), 7872–7889.
    [18]
    Xiaoqiang Lu, Binqiang Wang, and Xiangtao Zheng. 2019. Sound active attention framework for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 58, 3 (2019), 1985–2000.
    [19]
    Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.
    [20]
    Yafei Lv, Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.
    [21]
    Paolo Napoletano. 2018. Visual descriptors for content-based retrieval of remote-sensing images. International Journal of Remote Sensing 39, 5 (2018), 1343–1376.
    [22]
    Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2018. Exploiting convnet diversity for flooding identification. IEEE Geoscience and Remote Sensing Letters 15, 9 (2018), 1446–1450.
    [23]
    Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, 1–5.
    [24]
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.
    [25]
    Zhenwei Shi and Zhengxia Zou. 2017. Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Transactions on Geoscience and Remote Sensing 55, 6 (2017), 3623–3634.
    [26]
    Komal Nain Sukhia, M. Mohsin Riaz, Abdul Ghafoor, and Syed Sohaib Ali. 2020. Content-based remote sensing image retrieval using multi-scale local ternary pattern. Digital Signal Processing 104 (2020), 102765.
    [27]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
    [28]
    Binqiang Wang, Xiaoqiang Lu, Xiangtao Zheng, and Xuelong Li. 2019. Semantic descriptions of high-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters 16, 8 (2019), 1274–1278.
    [29]
    Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12–20.
    [30]
    Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764–5773.
    [31]
    Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19.
    [32]
    Wei Xiong, Zhenyu Xiong, Yang Zhang, Yaqi Cui, and Xiangqi Gu. 2020. A deep cross-modality hashing network for SAR and optical remote sensing images retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5284–5296.
    [33]
    Yi Yang and Shawn Newsam. 2010. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. 270–279.
    [34]
    Fanglong Yao, Xian Sun, Nayu Liu, Changyuan Tian, Liangyu Xu, Leiyi Hu, and Chibiao Ding. 2022. Hypergraph-enhanced textual-visual matching network for cross-modal remote sensing image retrieval via dynamic hypergraph learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2022), 688–701.
    [35]
    Dongjie Ye, Yansheng Li, Chao Tao, Xunwei Xie, and Xiang Wang. 2017. Multiple feature hashing learning for large-scale remote sensing image retrieval. ISPRS International Journal of Geo-Information 6, 11 (2017), 364.
    [36]
    Hongfeng Yu, Fanglong Yao, Wanxuan Lu, Nayu Liu, Peiguang Li, Hongjian You, and Xian Sun. 2022. Text-image matching for cross-modal remote sensing image retrieval via graph neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2022).
    [37]
    Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing (2022).
    [38]
    Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–19.
    [39]
    Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A multi-source cross-modal retrieval network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.
    [40]
    Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–16.
    [41]
    Fan Zhang, Bo Du, and Liangpei Zhang. 2014. Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing 53, 4 (2014), 2175–2184.
    [42]
    Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15661–15670.
    [43]
    Xiangrong Zhang, Xiang Li, Jinliang An, Li Gao, Biao Hou, and Chen Li. 2017. Natural language description of remote sensing images based on deep learning. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 4798–4801.
    [44]
    Xueting Zhang, Qi Wang, Shangdong Chen, and Xuelong Li. 2019. Multi-scale cropping mechanism for remote sensing image captioning. In -Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2019). IEEE, 10039–10042.
    [45]
    Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Transactions on Image Processing 28, 10 (2019), 4803–4818.
    [46]
    Zheng Zhang, Luyao Liu, Yadan Luo, Zi Huang, Fumin Shen, Heng Tao Shen, and Guangming Lu. 2020. Inductive structure consistent hashing via flexible semantic calibration. IEEE Transactions on Neural Networks and Learning Systems 32, 10 (2020), 4514–4528.
    [47]
    Zheng Zhang, Haoyang Luo, Lei Zhu, Guangming Lu, and Heng Tao Shen. 2022. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering (2022).
    [48]
    Chang Zou, Showhong Wan, Peiquan Jin, and Xingyue Li. 2018. A novel rotation invariance hashing network for fast remote sensing image retrieval. In Proceedings of the 10th International Conference on Digital Image Processing (ICDIP 2018), Vol. 10806. International Society for Optics and Photonics, 1080652.

    Cited By

    View all
    • (2024)Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339531362(1-15)Online publication date: 2024
    • (2024)JM-CLIP: A Joint Modal Similarity Contrastive Learning Model for Video-Text RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446490(3010-3014)Online publication date: 14-Apr-2024
    • (2023)Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333231761(1-13)Online publication date: 2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
    January 2024
    639 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613542
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2023
    Online AM: 07 June 2023
    Accepted: 22 May 2023
    Revised: 16 April 2023
    Received: 12 December 2022
    Published in TOMM Volume 20, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Remote sensing
    2. scale-semantic joint decoupling
    3. image-text retrieval

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Natural Science Foundation of China
    • Fundamental Research Funds for the Central Universities
    • National Key Research and Development Program of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)370
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339531362(1-15)Online publication date: 2024
    • (2024)JM-CLIP: A Joint Modal Similarity Contrastive Learning Model for Video-Text RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446490(3010-3014)Online publication date: 14-Apr-2024
    • (2023)Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333231761(1-13)Online publication date: 2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media