Abstract
Scene semantic segmentation plays an important role in computer vision. For real-time image segmentation, the waterfall atrous spatial pooling network (WASPnet) and the residual-two-network-segmentation (Res2Net-Seg) are two widely used lightweight networks. However, the WASPnet often loses the local features and the Res2Net-Seg tends to fuse trivially local features in the process of feature extraction. To solve these problems, this paper incorporates the advantage of Res2Net-Seg into WASPnet and extends the WASPnet with respect to two aspects. Firstly, a buffer ladder, which is based on the atrous convolution structure and the spatial pyramid pool architecture, is exploited to improve the deep feature extraction by capturing the multi-scale context. Secondly, the proposed architecture introduces a channel attention mechanism into the decoder. Thereby, the channel attention mechanism exploits the score maps output of the proposed structure. Compared to the WASPnet, the proposed network increases the MIoU on the Pascal visual object class (VOC) 2012 and Cityscapes dataset by 2.76% and 3.19%, respectively. In fact, the proposed buffer ladder improves not only the lightweight networks, but also the DeepLabv3+, which performs the best to date and has the similar module with WASPnet. The buffer ladder structure improves the MIoU of DeepLabv3+ on the Pascal VOC2012 and Cityscapes dataset by 1.48% and 2.11%, respectively. Finally, this paper proves the real-time performance with a GTX 2080Ti graphics processing unit and the results show that the proposed networks are capable of fulfilling real-time segmentation tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Not applicable.
References
Banu, A.S., Deivalakshmi, S.: Awunet: leaf area segmentation based on attention gate and wavelet pooling mechanism. Signal Image Video Process. 17(5), 1915–1924 (2022)
Candan, A.T., Kalkan, H.: U-net-based RGB and LiDAR image fusion for road segmentation. Signal Image Video Process. 17(6), 2837–2843 (2023)
Zhang, L., Lan, C., Fu, L., Mao, X., Zhang, M.: Segmentation of brain tumor MRI image based on improved attention module Unet network. Signal Image Video Process. 17(5), 2277–2285 (2023)
Lee, D.H., Liu, J.L.: End-to-end deep learning of lane detection and path prediction for real-time autonomous driving. Signal Image Video Process. 17(1), 199–205 (2023)
Zhang, Y., Qian, K., Zhu, Z., Yu, H., Zhang, B.: DBA-UNet: a double U-shaped boundary attention network for maxillary sinus anatomical structure segmentation in CBCT images. Signal Image Video Process. 17(5), 2251–2257 (2022)
Marhamati, M., Zadeh, A.A.L., Fard, M.M., Hussain, M.A., Jafarnezhad, K., Jafarnezhad, A., Bakhtoor, M., Momeny, M.: LAIU-NET: a learning-to-augment incorporated robust U-Net for depressed humans’ tongue segmentation. Displays 76, 102371 (2023)
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Samy, M., Amer, K., Eissa, K., Shaker, M., Elhelw, M.: Nu-net: deep residual wide field of view convolutional neural network for semantic segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018)
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. Computer ence (2015)
Artacho, B., Savakis, A.: Waterfall atrous spatial pooling architecture for efficient semantic segmentation. Sensors 19(24), 5361 (2019)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation (2017)
Liu, M., Fu, B., Xie, S., He, H., Lan, F., Li, Y., Lou, P., Fan, D.: Comparison of multi-source satellite images for classifying marsh vegetation using deeplabv3 plus deep learning algorithm. Ecol. Indic. 125, 107562 (2021)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2(88), 303–338 (2010)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223 (2016)
Zhou, W., Chen, K.: A lightweight hand gesture recognition in complex backgrounds. Displays 74, 102226 (2022)
Lv, Y., Ma, H., Li, J., Liu, S.: Attention guided U-Net with atrous convolution for accurate retinal vessels segmentation. IEEE Access 8, 32826–32839 (2020)
Yuan, Y., Zengyong, X., Gang, L.: Spedccnn: spatial pyramid-oriented encoder–decoder cascade convolution neural network for crop disease leaf segmentation. IEEE Access 9, 14849–14866 (2021)
Tian, Y., Chen, F., Wang, H., Zhang, S.: Real-time semantic segmentation network based on lite reduced atrous spatial pyramid pooling module group. In: 2020 5th International Conference on Control, Robotics and Cybernetics (CRC), pp. 139–143. IEEE (2020)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Wang, J., Pan, Z., Wang, G., Li, M., Li, Y.: Spatial pyramid pooling of selective convolutional features for vein recognition. IEEE Access 6, 28563–28572 (2018)
Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., Müller, K.-R.: Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE 109(3), 247–278 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Barchid, S., Mennesson, J., Djéraba, C.: Review on indoor RGB-D semantic segmentation with deep convolutional neural networks. In: 2021 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2021)
Zifeng, W., Shen, C., Van Den Hengel, A.: Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognit. 90, 119–133 (2019)
Singh, A., Kumar, D.: Detection of stress, anxiety and depression (SAD) in video surveillance using ResNet-101. Microprocess. Microsyst. 95, 104681 (2022)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: Vspw: a large-scale dataset for video scene parsing in the wild. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4131–4141 (2021)
Funding
This work is supported by Hisense Group.
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Z., Lei, Z. Buffer ladder feature fusion architecture for semantic segmentation improvement. SIViP 18, 475–483 (2024). https://doi.org/10.1007/s11760-023-02754-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02754-1