Abstract
In image segmentation tasks, contextual information is crucial as it provides essential semantic details. Multi-scale feature extraction methods help models capture this contextual information comprehensively, but they can introduce redundancy and insufficient receptive fields in some areas, particularly with large objects or complex scenes. To address these issues, we propose the Adaptive Feature Perception Module (AFPM). Inspired by the visual system, we combine the pyramid model with dilated convolutions and incorporate a spatial shift mechanism for extensive information capture.This module adaptively adjusts its focus and perception range to maximize target feature capture.Meanwhile, we introduce the Channel and Spectral Attention Module(CSAM) to model dependencies between channels and spectral domains,enabling the network to learn more discriminative features and improve segmentation accuracy. Based on these enhancements,we propose a new network model called AMFFNet. We validated its effectiveness by comparing it with several state-of-the-art methods on the PASCAL VOC 2012, Cityscapes and ADE20K datasets. The results demonstrate that AMFFNet offers superior performance.
Similar content being viewed by others
Data availability
The data that support the findings of this study are openly available in the PASCAL VOC2012 database, ADE20K database and Cityscapes database
References
Zhang, J., Zhao, X., Chen, Z., Zhejun, L.: A review of deep learning-based semantic segmentation for point cloud. IEEE Access 7, 179118–179133 (2019)
Guo, Y., Nie, G., Gao, W., Liao, M.: 2d semantic segmentation: recent developments and future directions. Future Internet 15(6), 205 (2023)
Li, B., Shi, Y., Qi, Z., Chen, Z.: A survey on semantic segmentation. In: 2018 IEEE International conference on data mining workshops (ICDMW), pp. 1233–1240. IEEE (2018)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440, (2015)
Li, Z., Sun, Y., Zhang, L., Tang, J.: CTnet: Context-based tandem network for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9904–9917 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany. Proceedings, part III 18, pages 234–241. Springer, (2015)
Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3126–3135, (2019)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818, (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890, (2017)
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. (2014) arXiv:1412.7062
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. (2017) arXiv:1706.05587
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H.S. et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6881–6890 (2021)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Info. Process. Syst. 34, 12077–12090 (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. et al.: An image is worth 16x16 words: transformers for image recognition at scale. (2020) arXiv:2010.11929
Li, Z., Tang, J., Mei, T.: Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2070–2083 (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5693–5703 (2019)
Huo, X., Sun, G., Tian, S., Wang, Y., Long, Yu., Long, J., Zhang, W., Li, A.: Hifuse: hierarchical multi-scale feature fusion network for medical image classification. Biomed. Signal Process. Control 87, 105534 (2024)
He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7519–7528 (2019)
Tianyi, W., Tang, S., Zhang, R., Cao, J., Zhang, Y.: Cgnet: a light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 30, 1169–1179 (2020)
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit. 106, 107404 (2020)
Xia, C., Wang, X., Lv, F., Hao, X., Shi, Y.: Vit-comer: vision transformer with convolutional multi-scale feature interaction for dense predictions. (2024) arXiv:2403.07392
Deng, Z., Ren, X., Ye, J., He, J., Qiao, Y.: Fcn+: Global receptive convolution makes fcn great again. (2023) arXiv:2303.04589
Wandell, B.A., Winawer, J.: Computational neuroimaging and population receptive fields. Trends Cognit. Sci. 19(6), 349–357 (2015)
Jun, F., Jing, L., Haijie, T., Yong, L., Yongjun, B., Zhiwei, F., Hanqing, L.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3146–3154 (2019)
Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., Jia, J.: Psanet: Point-wise spatial attention network for scene parsing. In: Proceedings of the European conference on computer vision (ECCV). pp. 267–283 (2018)
Wu, P., He, X., Tang, M., Lv, Y., Liu, J.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on multimedia. pp. 3518–3527 (2021)
Rao, Y., Zhao, W., Zhu, Z., Jiwen, L., Zhou, J.: Global filter networks for image classification. Adv. Neural Info. Process. Syst. 34, 980–993 (2021)
Geng, Z., Guo, M.-H., Chen, H., Li, X., Wei, K., Lin, Z.: Is attention better than matrix decomposition? (2021) arXiv:2109.04553
Guo, M.-H., Cheng-Ze, L., Hou, Q., Liu, Z., Cheng, M.-M., Shi-Min, H.: Segnext: rethinking convolutional attention design for semantic segmentation. Adv. Neural Info. Process. Syst. 35, 1140–1156 (2022)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)
Li, J., Wen, Y., He, L.: Scconv: spatial and channel reconstruction convolution for feature redundancy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6153–6162 (2023)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13713–13722 (2021)
Patro, B.N., Namboodiri, V.P., Agneeswaran, V.S.: Spectformer: frequency and attention is what you need in a vision transformer. (2023) arXiv:2304.06446
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111, 98–136 (2015)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
Ke, T.-W., Hwang, J.-J., Liu, Z., Yu, S.X.: Adaptive affinity fields for semantic segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 587–602 (2018)
Zifeng, W., Shen, C., Van Den Hengel, A.: Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognit 90, 119–133 (2019)
Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7151–7160 (2018)
Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 548–557 (2019)
He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3562–3572 (2019)
Zhang, X., Xu, H., Mo, H., Tan, J., Yang, C., Wang, L., Ren, W.: DCNAS: densely connected neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13956–13967 (2021)
Yuan, Y., Chen, X., Chen, X., Wang, J.: Segmentation transformer: object-contextual representations for semantic segmentation. (2019) arXiv:1909.11065
Liang, X., Zhou, H., Xing, E.: Dynamic-structured semantic propagation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 752–761 (2018)
Hou, Q., Zhang, L., Cheng, M.-M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4003–4012 (2020)
Huang, Y., Kang, D., Jia, W., He, X., Liu, L.: Channelized axial attention for semantic segmentation–considering channel relation within spatial attention for semantic segmentation. (2021) arXiv:2101.07434
Jun, F., Liu, J., Jiang, J., Li, Y., Bao, Y., Hanqing, L.: Scene segmentation with dual relation-aware attention network. IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2547–2560 (2020)
Generalizing mean field and beyond: Đ Khuê Lê-Huu and Karteek Alahari. Regularized frank-wolfe for dense crfs. Adv. Neural Info. Process. Syst. 34, 1453–1467 (2021)
Stammes, E., Runia, T.F.H., Hofmann, M., Ghafoorian, M.: Find it if you can: end-to-end adversarial erasing for weakly-supervised semantic segmentation. In: Thirteenth International Conference on Digital Image Processing (ICDIP 2021). vol. 11878, pp. 610–619. SPIE (2021)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1857–1866 (2018)
Guo, M.-H., Liu, Z.-N., Tai-Jiang, M., Shi-Min, H.: Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 5436–5447 (2022)
Zhong, Z., Lin, Z.Q., Bidart, R., Hu, X., Daya, I.B., Li, Z., Zheng, W.-S., Li, J., Wong, A.: Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13065–13074 (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11534–11542 (2020)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). pp. 3–19 (2018)
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 510–519 (2019)
Elhassan, M.A.M., Yang, C., Huang, C., Munea, T.L.: Technical report on subspace pyramid fusion network for semantic segmentation. (2022) arXiv:2204.01278
Acknowledgements
This work is supported by the Fundamental Research Funds for the Central Universities 3072022CF0801. We are grateful to the editor and the anonymous reviewers for their helpful suggestions to improve the quality of the paper.
Author information
Authors and Affiliations
Contributions
Haoyu Wang and Hongru Wang contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, H., Wang, H. Adaptive multi-scale feature fusion with spatial translation for semantic segmentation. SIViP 18, 8337–8348 (2024). https://doi.org/10.1007/s11760-024-03477-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-024-03477-7