Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Knowledge Guided Evolutionary Transformer for Remote Sensing Scene Classification

Published: 01 October 2024 Publication History

Abstract

Solving the complex challenges of sophisticated terrain and multi-scale targets in remote sensing (RS) images requires a synergistic combination of Transformer and convolutional neural network (CNN). However, crafting effective CNN architectures remains a major challenge. To address these difficulties, this study introduces the knowledge guided evolutionary Transformer for RS scene classification (Evo RSFormer). It amalgamates adaptive evolutionary CNN (Evo CNN) with Transformers in a hybrid strategy synergistically, which combines fine-grained local feature extraction of CNNs with long-range contextual dependency modeling of Transformers. Furthermore, for the development of Evo CNN blocks, this paper presents a knowledge-guided adaptive efficient multi-objective evolutionary neural architecture search (MOE2-NAS) strategy. This approach markedly diminishes the labor-intensive characteristics associated with traditional CNN design, striking a balance for both accuracy and compactness. Additionally, by leveraging domain knowledge from natural scene analysis into the RS field, MOE2-NAS facilitates the efficiency of classical NAS. It utilizes a priori knowledge to generate promising initial solutions and constructs a surrogate model for efficient search. The effectiveness of the proposed Evo RSFormer has been rigorously tested on various benchmark RS datasets, including UC Merced, NWPU45, and AID. Empirical results strongly support the superiority of Evo RSFormer over existing methods. Furthermore, experiments on MOE2-NAS have been studied to confirm the important role of knowledge guidance in improving the efficiency of NAS.

References

[1]
G. Li, W. Liu, Q. Gao, Q. Wang, J. Han, and X. Gao, “Self-supervised edge perceptual learning framework for high-resolution remote sensing images classification,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–15, 2023. 10.1109/TCSVT.2023.3343881.
[2]
N. Casagli, E. Intrieri, V. Tofani, G. Gigli, and F. Raspini, “Landslide detection, monitoring and prediction with remote-sensing techniques,” Nature Rev. Earth Environ., vol. 4, no. 1, pp. 51–64, Jan. 2023.
[3]
J. Wang, W. Li, Y. Wang, R. Tao, and Q. Du, “Representation-enhanced status replay network for multisource remote-sensing image classification,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–13, 2023. 10.1109/TNNLS.2023.3286422.
[4]
G. Wang, G. Cheng, P. Zhou, and J. Han, “Cross-level attentive feature aggregation for change detection,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–12, 2023. 10.1109/TCSVT.2023.3344092.
[5]
C. Guo, K. Liu, D. Deng, and X. Li, “ViT spatio-temporal feature fusion for aerial object tracking,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–13, 2023. 10.1109/TCSVT.2023.3326695.
[6]
M. Ma et al., “MBSI-Net: Multimodal balanced self-learning interaction network for image classification,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 3819–3833, May 2024.
[7]
L. Scheibenreif, M. Mommert, and D. Borth, “Masked vision transformers for hyperspectral image classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2023, pp. 2165–2175.
[8]
X. Nie et al., “Pro-tuning: Unified prompt tuning for vision tasks,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–15, 2023. 10.1109/TCSVT.2023.3327605.
[9]
Y. Shao et al., “CoT: Contourlet transformer for hierarchical semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2024. 10.1109/TNNLS.2024.3367901.
[10]
Y. Zhou, F. Wang, J. Zhao, R. Yao, S. Chen, and H. Ma, “Spatial–temporal based multihead self-attention for remote sensing image change detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 6615–6626, Oct. 2022.
[11]
J. Yang, B. Du, and L. Zhang, “Overcoming the barrier of incompleteness: A hyperspectral image classification full model,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2023. 10.1109/TNNLS.2023.3279377.
[12]
Z. Wei, X. Yang, N. Wang, and X. Gao, “Syncretic modality collaborative learning for visible infrared person re-identification,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2021, pp. 225–234.
[13]
D. Li, R. Liu, Y. Tang, and Y. Liu, “PSCLI-TF: Position-sensitive cross-layer interactive transformer model for remote sensing image scene classification,” IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024.
[14]
Y. Zhang, X. Gao, Q. Duan, J. Leng, X. Pu, and X. Gao, “Contextual learning in Fourier complex field for vhr remote sensing images,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2023. 10.1109/TNNLS.2023.3319363.
[15]
Y. Zhao, J. Liu, and Z. Wu, “CDLNet: Collaborative dictionary learning network for remote sensing image scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–16, 2023, Art. no.
[16]
X. Zheng et al., “Neural architecture search with representation mutual information,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 11902–11911.
[17]
X. Dong and Y. Yang, “NAS-Bench-201: Extending the scope of reproducible neural architecture search,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–16.
[18]
C. Li, H. Wang, J. Zhang, W. Yao, and T. Jiang, “An approximated gradient sign method using differential evolution for black-box adversarial attack,” IEEE Trans. Evol. Comput., vol. 26, no. 5, pp. 976–990, Oct. 2022.
[19]
C. Wang et al., “Bi-level multiobjective evolutionary learning: A case study on multitask graph neural topology search,” IEEE Trans. Evol. Comput., vol. 28, no. 1, pp. 208–222, Feb. 2024.
[20]
Q. Yao, J. Xu, W.-W. Tu, and Z. Zhu, “Efficient neural architecture search via proximal iterations,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 6664–6671.
[21]
Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan, “A survey on evolutionary neural architecture search,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 2, pp. 550–570, Feb. 2023.
[22]
Y.-L. Liao, S. Karaman, and V. Sze, “Searching for efficient multi-stage vision transformers,” 2021, arXiv:2109.00642.
[23]
M. Chen, H. Peng, J. Fu, and H. Ling, “AutoFormer: Searching transformers for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2021, pp. 12270–12280.
[24]
Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, “Deep multimodal neural architecture search,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 3743–3752.
[25]
C. Li {et al.}, “BossNAS: Exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2021, pp. 12281–12291.
[26]
Y. Liu, Y. Tang, Z. Lv, Y. Wang, and Y. Sun, “Bridge the gap between architecture spaces via a cross-domain predictor,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 13355–13366.
[27]
G. Shala, T. Elsken, F. Hutter, and J. Grabocka, “Transfer NAS with meta-learned Bayesian surrogates,” in Proc. 11th Int. Conf. Learn. Represent., 2023, pp. 1–17.
[28]
H. Lee, E. Hyung, and S. J. Hwang, “Rapid neural architecture search by learning to generate graphs from datasets,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–16.
[29]
D. Zhou et al., “EcoNAS: Finding proxies for economical neural architecture search,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11396–11404.
[30]
N. Wang et al., “NAS-FCOS: Fast neural architecture search for object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 11943–11951.
[31]
C. Peng, Y. Li, L. Jiao, and R. Shang, “Efficient convolutional neural architecture search for remote sensing image scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 7, pp. 6092–6105, Jul. 2021.
[32]
J. Wang, Y. Zhong, Z. Zheng, A. Ma, and L. Zhang, “RSNet: The search for remote sensing deep neural networks in recognition tasks,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 3, pp. 2520–2534, Mar. 2021.
[33]
A. Ma, Y. Wan, Y. Zhong, J. Wang, and L. Zhang, “SceneNet: Remote sensing scene classification deep learning network using multi-objective neural evolution architecture search,” ISPRS J. Photogramm. Remote Sens., vol. 172, pp. 171–188, Feb. 2021.
[34]
Y. Wan, Y. Zhong, A. Ma, J. Wang, and R. Feng, “RSSM-Net: Remote sensing image scene classification based on multi-objective neural architecture search,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Sep. 2020, pp. 1369–1372.
[35]
Y. Wan, Y. Zhong, A. Ma, J. Wang, and L. Zhang, “E2SCNet: Efficient multiobjective evolutionary automatic search for remote sensing image scene classification network architecture,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–15, 2022. 10.1109/TNNLS.2022.3220699.
[36]
C. Peng, Y. Li, R. Shang, and L. Jiao, “RSBNet: One-shot neural architecture search for a backbone network in remote sensing image recognition,” Neurocomputing, vol. 537, pp. 110–127, Jun. 2023.
[37]
M. Zhao, Q. Meng, L. Zhang, X. Hu, and L. Bruzzone, “Local and long-range collaborative learning for remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–15, 2023, Art. no.
[38]
G. Wang et al., “P2FEViT: Plug-and-play CNN feature embedded hybrid vision transformer for remote sensing image classification,” Remote Sens., vol. 15, no. 7, p. 1773, 2023.
[39]
P. Lv, W. Wu, Y. Zhong, F. Du, and L. Zhang, “SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no.
[40]
K. Xu, P. Deng, and H. Huang, “Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–15, 2022, Art. no.
[41]
G. Wang, N. Zhang, W. Liu, H. Chen, and Y. Xie, “MFST: A multi-level fusion network for remote sensing scene classification,” IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1–5, 2022.
[42]
X. Tang, M. Li, J. Ma, X. Zhang, F. Liu, and L. Jiao, “EMTCAL: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no.
[43]
J. Guo, N. Jia, and J. Bai, “Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image,” Sci. Rep., vol. 12, no. 1, p. 15473, 2022.
[44]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 10012–10022.
[45]
M. Zhang, S. Jiang, Z. Cui, R. Garnett, and Y. Chen, “D-VAE: A variational autoencoder for directed acyclic graphs,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–13.
[46]
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197, Apr. 2002.
[47]
Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-driven evolutionary optimization: An overview and case studies,” IEEE Trans. Evol. Comput., vol. 23, no. 3, pp. 442–458, Jun. 2018.
[48]
W. Chao, J. Zhao, L. Jiao, L. Li, F. Liu, and S. Yang, “A match made in consistency heaven: When large language models meet evolutionary algorithms,” 2024, arXiv:2401.10510.
[49]
S. Liu, H. Wang, W. Yao, and W. Peng, “Surrogate-assisted environmental selection for fast hypervolume-based many-objective optimization,” IEEE Trans. Evol. Comput., vol. 28, no. 1, pp. 132–146, Jul. 2024.
[50]
J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teho, “Set Transformer: A framework for attention-based permutation-invariant neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 3744–3753.
[51]
Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proc. 18th SIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., Nov. 2010, pp. 270–279.
[52]
G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proc. IEEE, vol. 105, no. 10, pp. 1865–1883, Oct. 2017.
[53]
G.-S. Xia et al., “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3965–3981, Jul. 2017.
[54]
Z. Zhao, J. Li, Z. Luo, J. Li, and C. Chen, “Remote sensing image scene classification based on an enhanced attention module,” IEEE Geosci. Remote Sens. Lett., vol. 18, no. 11, pp. 1926–1930, Nov. 2020.
[55]
Y. Yang et al., “Dual wavelet attention networks for image classification,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 4, pp. 1899–1910, Apr. 2023.
[56]
N. He, L. Fang, S. Li, J. Plaza, and A. Plaza, “Skip-connected covariance network for remote sensing scene classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 5, pp. 1461–1474, May 2020.
[57]
C. Shi, T. Wang, and L. Wang, “Branch feature fusion convolution network for remote sensing scene classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 5194–5210, 2020.
[58]
Q. Bi, K. Qin, Z. Li, H. Zhang, K. Xu, and G.-S. Xia, “A multiple-instance densely-connected ConvNet for aerial scene classification,” IEEE Trans. Image Process., vol. 29, pp. 4911–4926, 2020.
[59]
X. Wang, S. Wang, C. Ning, and H. Zhou, “Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 9, pp. 7918–7932, Sep. 2021.
[60]
W. Wang, Y. Chen, and P. Ghamisi, “Transferring CNN with adaptive learning for remote sensing scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–18, 2022, Art. no.
[61]
S. Wang, Y. Guan, and L. Shao, “Multi-granularity canonical appearance pooling for remote sensing scene classification,” IEEE Trans. Image Process., vol. 29, pp. 5396–5407, 2020.
[62]
J. Wang, W. Li, M. Zhang, R. Tao, and J. Chanussot, “Remote-sensing scene classification via multistage self-guided separation network,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–12, 2023, Art. no.
[63]
S. Li, M. Dai, and B. Li, “MMPC-Net: Multigranularity and multiscale progressive contrastive learning neural network for remote sensing image scene classification,” IEEE Geosci. Remote Sens. Lett., vol. 21, pp. 1–5, 2024.
[64]
J. Wu, L. Fang, and J. Yue, “TAKD: Target-aware knowledge distillation for remote sensing scene classification,” IEEE Trans. Circuits Syst. Video Technol., pp. 1–13, 2024. 10.1109/TCSVT.2024.3391018.
[65]
Z. Dong, Y. Gu, and T. Liu, “UPetu: A unified parameter-efficient fine-tuning framework for remote sensing foundation model,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024, Art. no.
[66]
Y. Yang, X. Tang, Y.-M. Cheung, X. Zhang, and L. Jiao, “SAGN: Semantic-aware graph network for remote sensing scene classification,” IEEE Trans. Image Process., vol. 32, pp. 1011–1025, 2023.
[67]
K. Xu, H. Huang, P. Deng, and Y. Li, “Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 10, pp. 5751–5765, Oct. 2022.
[68]
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 1–16.
[69]
L. Yuan et al., “Tokens-to-token ViT: Training vision transformers from scratch on ImageNet,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2021, pp. 558–567.
[70]
W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 568–578.
[71]
W. Yu et al., “MetaFormer is actually what you need for vision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 10819–10829.
[72]
X. Huang, F. Liu, Y. Cui, P. Chen, L. Li, and P. Li, “Faster and better: A lightweight transformer network for remote sensing scene classification,” Remote Sens., vol. 15, no. 14, p. 3645, Jul. 2023.
[73]
K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, “RSMamba: Remote sensing image classification with state space model,” 2024, arXiv:2403.19654.
[74]
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1251–1258.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Circuits and Systems for Video Technology  Volume 34, Issue 10_Part_2
Oct. 2024
761 pages

Publisher

IEEE Press

Publication History

Published: 01 October 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media