research-article

JMFEEL-Net: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting

Authors:

Yuanyuan ChenAuthors Info & Claims

Knowledge and Information Systems, Volume 66, Issue 5

Pages 3033 - 3053

https://doi.org/10.1007/s10115-023-02056-5

Published: 30 January 2024 Publication History

Abstract

Crowd counting based on convolutional neural networks (CNNs) has made significant progress in recent years. However, the limited receptive field of CNNs makes it challenging to capture global features for comprehensive contextual modeling, resulting in insufficient accuracy in count estimation. In comparison, vision transformer (ViT)-based counting networks have demonstrated remarkable performance by exploiting their powerful global contextual modeling capabilities. However, ViT models are associated with higher computational costs and training difficulty. In this paper, we propose a novel network named JMFEEL-Net, which utilizes joint multi-scale feature enhancement and lightweight transformer to improve crowd counting accuracy. Specifically, we use a high-resolution CNN as the backbone network to generate high-resolution feature maps. In the backend network, we propose a multi-scale feature enhancement module to address the problem of low recognition accuracy caused by multi-scale variations, especially when counting small-scale objects in dense scenes. Furthermore, we introduce an improved lightweight ViT encoder to effectively model complex global contexts. We also adopt a multi-density map supervision strategy to learn crowd distribution features from feature maps of different resolutions, thereby improving the quality and training efficiency of the density maps. To validate the effectiveness of the proposed method, we conduct extensive experiments on four challenging datasets, namely ShanghaiTech Part A/B, UCF-QNRF, and JHU-Crowd++, achieving very competitive counting performance.

References

[1]

Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–7

[2]

Sindagi VA and Patel VM A survey of recent advances in cnn-based single image crowd counting and density estimation Pattern Recogn Lett 2018 107 3-16

[3]

Liu Z, Wang Q, and Meng F A benchmark for multi-class object counting and size estimation using deep convolutional neural networks Eng Appl Artif Intell 2022 116

Digital Library

[4]

Ko T (2008) A survey on behavior analysis in video surveillance for homeland security applications. In: 2008 37th IEEE applied imagery pattern recognition workshop. IEEE, pp 1–8

[5]

Zhang Y, Zhou D, Chen S, Gao S, Ma Y (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597

[6]

Babu Sam D, Surya S, Venkatesh Babu R (2017) Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5744–5752

[7]

Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100

[8]

Liu W, Salzmann M, Fua P (2019) Context-aware crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5099–5108

[9]

Basalamah S, Khan SD, and Ullah H Scale driven convolutional neural network model for people counting and localization in crowd scenes IEEE Access 2019 7 71576-71584

[10]

Gao J, Wang Q, and Yuan Y Scar: spatial-/channel-wise attention regression networks for crowd counting Neurocomputing 2019 363 1-8

Digital Library

[11]

Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4706–4715

[12]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929

[13]

Liang D, Chen X, Xu W, Zhou Y, and Bai X Transcrowd: weakly-supervised crowd counting with transformers Sci China Inf Sci 2022 65 6

[14]

Lin H, Ma Z, Ji R, Wang Y, Hong X (2022) Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19628–19637

[15]

Tian Y, Chu X, Wang H (2021) CCTrans: simplifying and improving crowd counting with transformer. arXiv:2109.14483

[16]

Qian Y, Zhang L, Hong X, Donovan C, Arandjelovic O, Fife U, Harbin P (2022) Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: 2022 British machine vision conference. The British Machine Vision Association (BMVA)

[17]

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition IEEE Trans Pattern Anal Mach Intell 2020 43 10 3349-3364

[18]

Sam DB, Sajjan NN, Babu RV, Srinivasan M (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3618–3626

[19]

Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750

[20]

Sindagi VA, Patel VM (2017) Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1861–1870

[21]

Liu L, Qiu Z, Li G, Liu S, Ouyang W, Lin L (2019) Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1774–1783

[22]

Guo D, Li K, Zha Z-J, Wang M (2019) DADNet: dilated-attention-deformable convnet for crowd counting. In: Proceedings of the 27th ACM international conference on multimedia, pp 1823–1832

[23]

Liu N, Long Y, Zou C, Niu Q, Pan L, Wu H (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3225–3234

[24]

Zou Z, Cheng Y, Qu X, Ji S, Guo X, and Zhou P Attend to count: crowd counting with adaptive capacity multi-scale CNNs Neurocomputing 2019 367 75-83

Digital Library

[25]

Zhang A, Shen J, Xiao Z, Zhu F, Zhen X, Cao X, Shao L (2019) Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6788–6797

[26]

Xie J, Pang C, Zheng Y, Li L, Lyu C, Lyu L, and Liu H Multi-scale attention recalibration network for crowd counting Appl Soft Comput 2022 117

Digital Library

[27]

Mehta S, Rastegari M (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv:2110.02178

[28]

Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773

[29]

Idrees H, Tayyab M, Athrey K, Zhang D, Al-Maadeed S, Rajpoot N, Shah M (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European conference on computer vision (ECCV), pp 532–546

[30]

Sindagi VA, Yasarla R, and Patel VM JHU-Crowd++: large-scale crowd counting dataset and a benchmark method IEEE Trans Pattern Anal Mach Intell 2020 44 5 2594-2609

[31]

Liang D, Xu W, Zhu Y, Zhou Y (2022) Focal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia

[32]

Liang D, Xu W, Bai X (2022) An end-to-end transformer model for crowd localization. In: European conference on computer vision. Springer, pp 38–54

[33]

Dai M, Huang Z, Gao J, Shan H, Zhang J (2023) Cross-head supervision for crowd counting with noisy annotations. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1–5

[34]

Wang Q and Breckon TP Crowd counting via segmentation guided attention networks and curriculum loss IEEE Trans Intell Transp Syst 2022 23 9 15233-15243

Digital Library

[35]

Gao X, Xie J, Chen Z, Liu A-A, Sun Z, and Lyu L Dilated convolution-based feature refinement network for crowd localization ACM Trans Multimed Comput Commun Appl 2023 19 6 1-16

Digital Library

[36]

Tian Y, Lei Y, Zhang J, and Wang JZ Padnet: pan-density crowd counting IEEE Trans Image Process 2019 29 2714-2727

[37]

Liu X, Yang J, Ding W, Wang T, Wang Z, Xiong J (2020) Adaptive mixture regression network with local counting map for crowd counting. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, pp 241–257

[38]

Wei B, Yuan Y, Wang Q (2020) MSPNet: multi-supervised parallel network for crowd counting. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2418–2422

[39]

Wan J and Chan A Modeling noisy annotations for crowd counting Adv Neural Inf Process Syst 2020 33 3386-3396

[40]

Khan SD and Basalamah S Sparse to dense scale prediction for crowd couting in high density crowds Arab J Sci Eng 2021 46 4 3051-3065

[41]

Xu C, Liang D, Xu Y, Bai S, Zhan W, Bai X, and Tomizuka M AutoScale: learning to scale for crowd counting Int J Comput Vision 2022 130 2 405-434

Digital Library

[42]

Khan SD and Basalamah S Scale and density invariant head detection deep model for crowd counting in pedestrian crowds Vis Comput 2021 37 8 2127-2137

Digital Library

[43]

Wan J, Liu Z, Chan AB (2021) A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1974–1983

[44]

Khan SD, Salih Y, Zafar B, and Noorwali A A deep-fusion network for crowd counting in high-density crowded scenes Int J Comput Intell Syst 2021 14 1 168

[45]

Meng Y, Bridge J, Wei M, Zhao Y, Qiao Y, Yang X, Huang X, Zheng Y (2022) Counting with adaptive auxiliary learning. arXiv:2203.04061

Cited By

Feng BRuan SWu LLiu HZhang KZhang KLiu QChen E(2024)Caption matters: a new perspective for knowledge-based visual question answeringKnowledge and Information Systems10.1007/s10115-024-02166-866:11(6975-7003)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s10115-024-02166-8

Recommendations

Multi-scale dilated convolution of feature Fusion Network for Crowd counting
Abstract
Crowd counting has long been a challenging task due to the perspective distortion and variability in head size. The previous methods ignore the multi-scale information in images or simply use convolutions with different kernel sizes to extract ...
SA-InterNet: Scale-Aware Interaction Network for Joint Crowd Counting and Localization
Pattern Recognition and Computer Vision
Abstract
Crowd counting and crowd localization are essential and challenging tasks due to uneven distribution and scale variation. Recent studies have shown that crowd counting and localization can complement and guide each other from two different ...
SC2Net: Scale-aware Crowd Counting Network with Pyramid Dilated Convolution
Abstract
Accurate crowd counting is still challenging due to the variations of crowd heads. Most of crowd counting methods adopt multi-branch networks to extract multi-scale information. However, these networks are too complex to be optimized. To solve ...

Comments

Information & Contributors

Information

Published In

cover image Knowledge and Information Systems

Knowledge and Information Systems Volume 66, Issue 5

May 2024

451 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 30 January 2024

Accepted: 26 December 2023

Revision received: 29 October 2023

Received: 24 July 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Feng BRuan SWu LLiu HZhang KZhang KLiu QChen E(2024)Caption matters: a new perspective for knowledge-based visual question answeringKnowledge and Information Systems10.1007/s10115-024-02166-866:11(6975-7003)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s10115-024-02166-8

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents