Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

JMFEEL-Net: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting

Published: 30 January 2024 Publication History

Abstract

Crowd counting based on convolutional neural networks (CNNs) has made significant progress in recent years. However, the limited receptive field of CNNs makes it challenging to capture global features for comprehensive contextual modeling, resulting in insufficient accuracy in count estimation. In comparison, vision transformer (ViT)-based counting networks have demonstrated remarkable performance by exploiting their powerful global contextual modeling capabilities. However, ViT models are associated with higher computational costs and training difficulty. In this paper, we propose a novel network named JMFEEL-Net, which utilizes joint multi-scale feature enhancement and lightweight transformer to improve crowd counting accuracy. Specifically, we use a high-resolution CNN as the backbone network to generate high-resolution feature maps. In the backend network, we propose a multi-scale feature enhancement module to address the problem of low recognition accuracy caused by multi-scale variations, especially when counting small-scale objects in dense scenes. Furthermore, we introduce an improved lightweight ViT encoder to effectively model complex global contexts. We also adopt a multi-density map supervision strategy to learn crowd distribution features from feature maps of different resolutions, thereby improving the quality and training efficiency of the density maps. To validate the effectiveness of the proposed method, we conduct extensive experiments on four challenging datasets, namely ShanghaiTech Part A/B, UCF-QNRF, and JHU-Crowd++, achieving very competitive counting performance.

References

[1]
Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–7
[2]
Sindagi VA and Patel VM A survey of recent advances in cnn-based single image crowd counting and density estimation Pattern Recogn Lett 2018 107 3-16
[3]
Liu Z, Wang Q, and Meng F A benchmark for multi-class object counting and size estimation using deep convolutional neural networks Eng Appl Artif Intell 2022 116
[4]
Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 2008 37th IEEE applied imagery pattern recognition workshop. IEEE, pp 1–8
[5]
Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597
[6]
Babu Sam D, Surya S, Venkatesh Babu R (2017) Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5744–5752
[7]
Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100
[8]
Liu W, Salzmann M, Fua P (2019) Context-aware crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5099–5108
[9]
Basalamah S, Khan SD, and Ullah H Scale driven convolutional neural network model for people counting and localization in crowd scenes IEEE Access 2019 7 71576-71584
[10]
Gao J, Wang Q, and Yuan Y Scar: spatial-/channel-wise attention regression networks for crowd counting Neurocomputing 2019 363 1-8
[11]
Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4706–4715
[12]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
[13]
Liang D, Chen X, Xu W, Zhou Y, and Bai X Transcrowd: weakly-supervised crowd counting with transformers Sci China Inf Sci 2022 65 6
[14]
Lin H, Ma Z, Ji R, Wang Y, Hong X (2022) Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19628–19637
[15]
Tian Y, Chu X, Wang H (2021) CCTrans: simplifying and improving crowd counting with transformer. arXiv:2109.14483
[16]
Qian Y, Zhang L, Hong X, Donovan C, Arandjelovic O, Fife U, Harbin P (2022) Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: 2022 British machine vision conference. The British Machine Vision Association (BMVA)
[17]
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition IEEE Trans Pattern Anal Mach Intell 2020 43 10 3349-3364
[18]
Sam DB, Sajjan NN, Babu RV, Srinivasan M (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3618–3626
[19]
Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750
[20]
Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1861–1870
[21]
Liu L, Qiu Z, Li G, Liu S, Ouyang W, Lin L (2019) Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1774–1783
[22]
Guo D, Li K, Zha Z-J, Wang M (2019) DADNet: dilated-attention-deformable convnet for crowd counting. In: Proceedings of the 27th ACM international conference on multimedia, pp 1823–1832
[23]
Liu N, Long Y, Zou C, Niu Q, Pan L, Wu H (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3225–3234
[24]
Zou Z, Cheng Y, Qu X, Ji S, Guo X, and Zhou P Attend to count: crowd counting with adaptive capacity multi-scale CNNs Neurocomputing 2019 367 75-83
[25]
Zhang A, Shen J, Xiao Z, Zhu F, Zhen X, Cao X, Shao L (2019) Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6788–6797
[26]
Xie J, Pang C, Zheng Y, Li L, Lyu C, Lyu L, and Liu H Multi-scale attention recalibration network for crowd counting Appl Soft Comput 2022 117
[27]
Mehta S, Rastegari M (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv:2110.02178
[28]
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
[29]
Idrees H, Tayyab M, Athrey K, Zhang D, Al-Maadeed S, Rajpoot N, Shah M (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European conference on computer vision (ECCV), pp 532–546
[30]
Sindagi VA, Yasarla R, and Patel VM JHU-Crowd++: large-scale crowd counting dataset and a benchmark method IEEE Trans Pattern Anal Mach Intell 2020 44 5 2594-2609
[31]
Liang D, Xu W, Zhu Y, Zhou Y (2022) Focal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia
[32]
Liang D, Xu W, Bai X (2022) An end-to-end transformer model for crowd localization. In: European conference on computer vision. Springer, pp 38–54
[33]
Dai M, Huang Z, Gao J, Shan H, Zhang J (2023) Cross-head supervision for crowd counting with noisy annotations. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1–5
[34]
Wang Q and Breckon TP Crowd counting via segmentation guided attention networks and curriculum loss IEEE Trans Intell Transp Syst 2022 23 9 15233-15243
[35]
Gao X, Xie J, Chen Z, Liu A-A, Sun Z, and Lyu L Dilated convolution-based feature refinement network for crowd localization ACM Trans Multimed Comput Commun Appl 2023 19 6 1-16
[36]
Tian Y, Lei Y, Zhang J, and Wang JZ Padnet: pan-density crowd counting IEEE Trans Image Process 2019 29 2714-2727
[37]
Liu X, Yang J, Ding W, Wang T, Wang Z, Xiong J (2020) Adaptive mixture regression network with local counting map for crowd counting. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, pp 241–257
[38]
Wei B, Yuan Y, Wang Q (2020) MSPNet: multi-supervised parallel network for crowd counting. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2418–2422
[39]
Wan J and Chan A Modeling noisy annotations for crowd counting Adv Neural Inf Process Syst 2020 33 3386-3396
[40]
Khan SD and Basalamah S Sparse to dense scale prediction for crowd couting in high density crowds Arab J Sci Eng 2021 46 4 3051-3065
[41]
Xu C, Liang D, Xu Y, Bai S, Zhan W, Bai X, and Tomizuka M AutoScale: learning to scale for crowd counting Int J Comput Vision 2022 130 2 405-434
[42]
Khan SD and Basalamah S Scale and density invariant head detection deep model for crowd counting in pedestrian crowds Vis Comput 2021 37 8 2127-2137
[43]
Wan J, Liu Z, Chan AB (2021) A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1974–1983
[44]
Khan SD, Salih Y, Zafar B, and Noorwali A A deep-fusion network for crowd counting in high-density crowded scenes Int J Comput Intell Syst 2021 14 1 168
[45]
Meng Y, Bridge J, Wei M, Zhao Y, Qiao Y, Yang X, Huang X, Zheng Y (2022) Counting with adaptive auxiliary learning. arXiv:2203.04061

Cited By

View all
  • (2024)Caption matters: a new perspective for knowledge-based visual question answeringKnowledge and Information Systems10.1007/s10115-024-02166-866:11(6975-7003)Online publication date: 1-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems
Knowledge and Information Systems  Volume 66, Issue 5
May 2024
451 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 January 2024
Accepted: 26 December 2023
Revision received: 29 October 2023
Received: 24 July 2023

Author Tags

  1. Crowd counting
  2. Count estimation
  3. Multi-scale variations
  4. Multi-density map supervision

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Caption matters: a new perspective for knowledge-based visual question answeringKnowledge and Information Systems10.1007/s10115-024-02166-866:11(6975-7003)Online publication date: 1-Nov-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media