Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Adaptive multi-scale feature fusion with spatial translation for semantic segmentation

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

In image segmentation tasks, contextual information is crucial as it provides essential semantic details. Multi-scale feature extraction methods help models capture this contextual information comprehensively, but they can introduce redundancy and insufficient receptive fields in some areas, particularly with large objects or complex scenes. To address these issues, we propose the Adaptive Feature Perception Module (AFPM). Inspired by the visual system, we combine the pyramid model with dilated convolutions and incorporate a spatial shift mechanism for extensive information capture.This module adaptively adjusts its focus and perception range to maximize target feature capture.Meanwhile, we introduce the Channel and Spectral Attention Module(CSAM) to model dependencies between channels and spectral domains,enabling the network to learn more discriminative features and improve segmentation accuracy. Based on these enhancements,we propose a new network model called AMFFNet. We validated its effectiveness by comparing it with several state-of-the-art methods on the PASCAL VOC 2012, Cityscapes and ADE20K datasets. The results demonstrate that AMFFNet offers superior performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data that support the findings of this study are openly available in the PASCAL VOC2012 database, ADE20K database and Cityscapes database

References

  1. Zhang, J., Zhao, X., Chen, Z., Zhejun, L.: A review of deep learning-based semantic segmentation for point cloud. IEEE Access 7, 179118–179133 (2019)

    Article  Google Scholar 

  2. Guo, Y., Nie, G., Gao, W., Liao, M.: 2d semantic segmentation: recent developments and future directions. Future Internet 15(6), 205 (2023)

    Article  Google Scholar 

  3. Li, B., Shi, Y., Qi, Z., Chen, Z.: A survey on semantic segmentation. In: 2018 IEEE International conference on data mining workshops (ICDMW), pp. 1233–1240. IEEE (2018)

  4. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440, (2015)

  5. Li, Z., Sun, Y., Zhang, L., Tang, J.: CTnet: Context-based tandem network for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9904–9917 (2021)

    Article  Google Scholar 

  6. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany. Proceedings, part III 18, pages 234–241. Springer, (2015)

  7. Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3126–3135, (2019)

  8. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818, (2018)

  9. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890, (2017)

  10. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. (2014) arXiv:1412.7062

  11. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  12. Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. (2017) arXiv:1706.05587

  13. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S. et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6881–6890 (2021)

  14. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Info. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  15. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al.: An image is worth 16x16 words: transformers for image recognition at scale. (2020) arXiv:2010.11929

  16. Li, Z., Tang, J., Mei, T.: Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2070–2083 (2018)

    Article  Google Scholar 

  17. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5693–5703 (2019)

  18. Huo, X., Sun, G., Tian, S., Wang, Y., Long, Yu., Long, J., Zhang, W., Li, A.: Hifuse: hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 87, 105534 (2024)

    Article  Google Scholar 

  19. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7519–7528 (2019)

  20. Tianyi, W., Tang, S., Zhang, R., Cao, J., Zhang, Y.: Cgnet: a light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 30, 1169–1179 (2020)

    Google Scholar 

  21. Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit. 106, 107404 (2020)

    Article  Google Scholar 

  22. Xia, C., Wang, X., Lv, F., Hao, X., Shi, Y.: Vit-comer: vision transformer with convolutional multi-scale feature interaction for dense predictions. (2024) arXiv:2403.07392

  23. Deng, Z., Ren, X., Ye, J., He, J., Qiao, Y.: Fcn+: Global receptive convolution makes fcn great again. (2023) arXiv:2303.04589

  24. Wandell, B.A., Winawer, J.: Computational neuroimaging and population receptive fields. Trends Cognit. Sci. 19(6), 349–357 (2015)

    Article  Google Scholar 

  25. Jun, F., Jing, L., Haijie, T., Yong, L., Yongjun, B., Zhiwei, F., Hanqing, L.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3146–3154 (2019)

  26. Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., Jia, J.: Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the European conference on computer vision (ECCV). pp. 267–283 (2018)

  27. Wu, P., He, X., Tang, M., Lv, Y., Liu, J.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on multimedia. pp. 3518–3527 (2021)

  28. Rao, Y., Zhao, W., Zhu, Z., Jiwen, L., Zhou, J.: Global filter networks for image classification. Adv. Neural Info. Process. Syst. 34, 980–993 (2021)

    Google Scholar 

  29. Geng, Z., Guo, M.-H., Chen, H., Li, X., Wei, K., Lin, Z.: Is attention better than matrix decomposition? (2021) arXiv:2109.04553

  30. Guo, M.-H., Cheng-Ze, L., Hou, Q., Liu, Z., Cheng, M.-M., Shi-Min, H.: Segnext: rethinking convolutional attention design for semantic segmentation. Adv. Neural Info. Process. Syst. 35, 1140–1156 (2022)

    Google Scholar 

  31. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)

  32. Li, J., Wen, Y., He, L.: Scconv: spatial and channel reconstruction convolution for feature redundancy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6153–6162 (2023)

  33. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13713–13722 (2021)

  34. Patro, B.N., Namboodiri, V.P., Agneeswaran, V.S.: Spectformer: frequency and attention is what you need in a vision transformer. (2023) arXiv:2304.06446

  35. Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111, 98–136 (2015)

    Article  Google Scholar 

  36. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)

  37. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  38. Ke, T.-W., Hwang, J.-J., Liu, Z., Yu, S.X.: Adaptive affinity fields for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 587–602 (2018)

  39. Zifeng, W., Shen, C., Van Den Hengel, A.: Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognit 90, 119–133 (2019)

    Article  Google Scholar 

  40. Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7151–7160 (2018)

  41. Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 548–557 (2019)

  42. He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3562–3572 (2019)

  43. Zhang, X., Xu, H., Mo, H., Tan, J., Yang, C., Wang, L., Ren, W.: DCNAS: densely connected neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13956–13967 (2021)

  44. Yuan, Y., Chen, X., Chen, X., Wang, J.: Segmentation transformer: object-contextual representations for semantic segmentation. (2019) arXiv:1909.11065

  45. Liang, X., Zhou, H., Xing, E.: Dynamic-structured semantic propagation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 752–761 (2018)

  46. Hou, Q., Zhang, L., Cheng, M.-M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4003–4012 (2020)

  47. Huang, Y., Kang, D., Jia, W., He, X., Liu, L.: Channelized axial attention for semantic segmentation–considering channel relation within spatial attention for semantic segmentation. (2021) arXiv:2101.07434

  48. Jun, F., Liu, J., Jiang, J., Li, Y., Bao, Y., Hanqing, L.: Scene segmentation with dual relation-aware attention network. IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2547–2560 (2020)

    Google Scholar 

  49. Generalizing mean field and beyond: Đ Khuê Lê-Huu and Karteek Alahari. Regularized frank-wolfe for dense crfs. Adv. Neural Info. Process. Syst. 34, 1453–1467 (2021)

    Google Scholar 

  50. Stammes, E., Runia, T.F.H., Hofmann, M., Ghafoorian, M.: Find it if you can: end-to-end adversarial erasing for weakly-supervised semantic segmentation. In: Thirteenth International Conference on Digital Image Processing (ICDIP 2021). vol. 11878, pp. 610–619. SPIE (2021)

  51. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1857–1866 (2018)

  52. Guo, M.-H., Liu, Z.-N., Tai-Jiang, M., Shi-Min, H.: Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 5436–5447 (2022)

    Google Scholar 

  53. Zhong, Z., Lin, Z.Q., Bidart, R., Hu, X., Daya, I.B., Li, Z., Zheng, W.-S., Li, J., Wong, A.: Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13065–13074 (2020)

  54. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)

  55. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11534–11542 (2020)

  56. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). pp. 3–19 (2018)

  57. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 510–519 (2019)

  58. Elhassan, M.A.M., Yang, C., Huang, C., Munea, T.L.: Technical report on subspace pyramid fusion network for semantic segmentation. (2022) arXiv:2204.01278

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities 3072022CF0801. We are grateful to the editor and the anonymous reviewers for their helpful suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Contributions

Haoyu Wang and Hongru Wang contributed equally to this work.

Corresponding author

Correspondence to Haoyu Wang.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Wang, H. Adaptive multi-scale feature fusion with spatial translation for semantic segmentation. SIViP 18, 8337–8348 (2024). https://doi.org/10.1007/s11760-024-03477-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-024-03477-7

Keywords