research-article

Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing

Authors: Chengyu Zheng, Ning Song, Ruoyu Zhang, Lei Huang, Zhiqiang Wei, Jie NieAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 1

Article No.: 4, Pages 1 - 20

https://doi.org/10.1145/3603628

Published: 24 August 2023 Publication History

Abstract

Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to “scale decoupling” and “semantic decoupling” strategies to further enhance the capability of representation. However, these previous approaches focus on either the disentangling scale or semantics but ignore merging these two ideas in a union model, which extremely limits the performance of cross-modal retrieval models. To address these issues, we propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote sensing image-text retrieval. Specifically, we design the Bidirectional Scale Decoupling (BSD) module, which exploits Salience Extraction Map (SEM) and Salience Suppression Map (SSM) units to adaptively extract potential features and suppress cumbersome features at other scales in a bidirectional pattern to yield different scale clues. Besides, we design the Label-supervised Semantic Decoupling (LSD) module by leveraging the category semantic labels as prior knowledge to supervise images and texts probing significant semantic-related information. Finally, we design a Semantic-guided Triple Loss (STL), which adaptively generates a constant to adjust the loss function to improve the probability of matching the same semantic image and text and shorten the convergence time of the retrieval model. Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.

References

[1]

Taghreed Abdullah, Yakoub Bazi, Mohamad M. Al Rahhal, Mohamed L. Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.

[2]

Cong Bai, Minjing Zhang, Jinglin Zhang, Jianwei Zheng, and Shengyong Chen. 2021. LSCIDMR: Large-scale satellite cloud image database for meteorological research. IEEE Transactions on Cybernetics (2021).

[3]

Cong Bai, Dongxiaoyuan Zhao, Minjing Zhang, and Jinglin Zhang. 2022. Multimodal information fusion for weather systems and clouds identification from satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 7333–7345.

[4]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 801–818.

Digital Library

[5]

Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 522–531.

[6]

Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, and Liang Zhang. 2021. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 4284–4297.

[7]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[9]

Genc Hoxha, Farid Melgani, and Begüm Demir. 2020. Toward remote sensing image retrieval under a deep image captioning perspective. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 4462–4475.

[10]

Genc Hoxha, Farid Melgani, and Jacopo Slaghenauffi. 2020. A new CNN-RNN framework for remote sensing image captioning. In Proceedings of the 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS). IEEE, 1–4.

[11]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.

[12]

Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302.

Digital Library

[13]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201–216.

Digital Library

[14]

Xuelong Li, Xueting Zhang, Wei Huang, and Qi Wang. 2020. Truncation cross entropy loss for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 59, 6 (2020), 5246–5257.

[15]

Chao Liu, Jingjing Ma, Xu Tang, Xiangrong Zhang, and Licheng Jiao. 2019. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the -2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2019). IEEE, 4324–4327.

[16]

Yishu Liu, Conghui Chen, Zhengzhuo Han, Liwang Ding, and Yingbin Liu. 2020. High-resolution remote sensing image retrieval based on classification-similarity networks and double fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 1119–1133.

[17]

Yishu Liu, Liwang Ding, Conghui Chen, and Yingbin Liu. 2020. Similarity-based unsupervised deep transfer learning for remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing 58, 11 (2020), 7872–7889.

[18]

Xiaoqiang Lu, Binqiang Wang, and Xiangtao Zheng. 2019. Sound active attention framework for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 58, 3 (2019), 1985–2000.

[19]

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.

[20]

Yafei Lv, Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.

[21]

Paolo Napoletano. 2018. Visual descriptors for content-based retrieval of remote-sensing images. International Journal of Remote Sensing 39, 5 (2018), 1343–1376.

[22]

Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2018. Exploiting convnet diversity for flooding identification. IEEE Geoscience and Remote Sensing Letters 15, 9 (2018), 1446–1450.

[23]

Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS). IEEE, 1–5.

[24]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.

[25]

Zhenwei Shi and Zhengxia Zou. 2017. Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Transactions on Geoscience and Remote Sensing 55, 6 (2017), 3623–3634.

[26]

Komal Nain Sukhia, M. Mohsin Riaz, Abdul Ghafoor, and Syed Sohaib Ali. 2020. Content-based remote sensing image retrieval using multi-scale local ternary pattern. Digital Signal Processing 104 (2020), 102765.

[27]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.

[28]

Binqiang Wang, Xiaoqiang Lu, Xiangtao Zheng, and Xuelong Li. 2019. Semantic descriptions of high-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters 16, 8 (2019), 1274–1278.

[29]

Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12–20.

Digital Library

[30]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764–5773.

[31]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19.

Digital Library

[32]

Wei Xiong, Zhenyu Xiong, Yang Zhang, Yaqi Cui, and Xiangqi Gu. 2020. A deep cross-modality hashing network for SAR and optical remote sensing images retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5284–5296.

[33]

Yi Yang and Shawn Newsam. 2010. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems. 270–279.

Digital Library

[34]

Fanglong Yao, Xian Sun, Nayu Liu, Changyuan Tian, Liangyu Xu, Leiyi Hu, and Chibiao Ding. 2022. Hypergraph-enhanced textual-visual matching network for cross-modal remote sensing image retrieval via dynamic hypergraph learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2022), 688–701.

[35]

Dongjie Ye, Yansheng Li, Chao Tao, Xunwei Xie, and Xiang Wang. 2017. Multiple feature hashing learning for large-scale remote sensing image retrieval. ISPRS International Journal of Geo-Information 6, 11 (2017), 364.

[36]

Hongfeng Yu, Fanglong Yao, Wanxuan Lu, Nayu Liu, Peiguang Li, Hongjian You, and Xian Sun. 2022. Text-image matching for cross-modal remote sensing image retrieval via graph neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2022).

[37]

Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing (2022).

[38]

Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–19.

[39]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A multi-source cross-modal retrieval network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.

[40]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–16.

[41]

Fan Zhang, Bo Du, and Liangpei Zhang. 2014. Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing 53, 4 (2014), 2175–2184.

[42]

Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15661–15670.

[43]

Xiangrong Zhang, Xiang Li, Jinliang An, Li Gao, Biao Hou, and Chen Li. 2017. Natural language description of remote sensing images based on deep learning. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 4798–4801.

[44]

Xueting Zhang, Qi Wang, Shangdong Chen, and Xuelong Li. 2019. Multi-scale cropping mechanism for remote sensing image captioning. In -Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2019). IEEE, 10039–10042.

[45]

Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Transactions on Image Processing 28, 10 (2019), 4803–4818.

[46]

Zheng Zhang, Luyao Liu, Yadan Luo, Zi Huang, Fumin Shen, Heng Tao Shen, and Guangming Lu. 2020. Inductive structure consistent hashing via flexible semantic calibration. IEEE Transactions on Neural Networks and Learning Systems 32, 10 (2020), 4514–4528.

[47]

Zheng Zhang, Haoyang Luo, Lei Zhu, Guangming Lu, and Heng Tao Shen. 2022. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering (2022).

[48]

Chang Zou, Showhong Wan, Peiquan Jin, and Xingyue Li. 2018. A novel rotation invariance hashing network for fast remote sensing image retrieval. In Proceedings of the 10th International Conference on Digital Image Processing (ICDIP 2018), Vol. 10806. International Society for Optics and Photonics, 1080652.

Cited By

Wu DLi HHou YXu CCheng GGuo LLiu H(2024)Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339531362(1-15)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3395313
Ge MLi YWu HLi M(2024)JM-CLIP: A Joint Modal Similarity Contrastive Learning Model for Video-Text RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446490(3010-3014)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446490
Ji ZMeng CZhang YPang YLi X(2023)Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333231761(1-13)Online publication date: 2023
https://doi.org/10.1109/TGRS.2023.3332317

Index Terms

Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. Our highlight is the proposal of a paradigm ...
Semantic Completion and Filtration for Image–Text Retrieval
Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and ...
Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression
Advances in Knowledge Discovery and Data Mining
Abstract
Image-text retrieval is an essential branch in the field of information retrieval, facing the serious challenge of the cross-modal semantic gap. Although significant progress has been made in recent years, most research has ignored an essential ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 1

January 2024

639 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3613542

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2023

Online AM: 07 June 2023

Accepted: 22 May 2023

Revised: 16 April 2023

Received: 12 December 2022

Published in TOMM Volume 20, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
392
Total Downloads

Downloads (Last 12 months)370
Downloads (Last 6 weeks)17

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu DLi HHou YXu CCheng GGuo LLiu H(2024)Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.339531362(1-15)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3395313
Ge MLi YWu HLi M(2024)JM-CLIP: A Joint Modal Similarity Contrastive Learning Model for Video-Text RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446490(3010-3014)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446490
Ji ZMeng CZhang YPang YLi X(2023)Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333231761(1-13)Online publication date: 2023
https://doi.org/10.1109/TGRS.2023.3332317

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents