Abstract
Crowd counting is a task to estimate the total number of pedestrians in an image. In most of the existing research, good vision problems, such as in parks, squares, and bright shopping malls during the day, have been addressed. However, there is little research on complex scenes in darkness. To study this problem, we propose an interactive network based on Transformer for multi-modal crowd counting. First, sliding convolutional encoding is adopted for the image to obtain better encoding features. The features are extracted through the designed primary interaction network, and then channel token attention is used to modulate the features. Then, the FGAF-MLP is used for high and low semantic fusion to enhance the feature expression and fully fuse the data in different modes to improve the accuracy of the method. To verify the effectiveness of our method, we conducted extensive ablation experiments with the latest multimodal benchmark RGBT-CC, and we verified the complementarity between multiple modal data and the effectiveness of the model components. We also verified the effectiveness of our method with the ShanghaiTechRGBD benchmark. The experimental results showed that our proposed method exhibits good results and achieves an improvement of more than 10\(\%\) in terms of the mean average error and mean squared error for the RGBT-CC benchmark.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Kumar N, Raubal M (2021) Applications of deep learning in congestion detection, prediction and alleviation: A survey. Transp Res C Emerg Technol 133:103432. https://doi.org/10.1016/j.trc.2021.103432. Get rights and content
Bamaqa A, Sedky M, Bosakowski T et al (2022) SIMCD: SIMulated crowd data for anomaly detection and prediction. Expert Syst Appl 203:117475. https://doi.org/10.1016/j.eswa.2022.117475. Get rights and content
Fan Z, Zhang H, Zhang Z et al (2022) A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing 472:224–251. https://doi.org/10.1016/j.neucom.2021.02.103
Topkaya I S, Erdogan H, Porikli F (2014) Counting people by clustering person detector outputs. In: Proc of the 11th IEEE Int Conf on Advanced Video and Signal Based Surveillance, IEEE, Piscataway, NJ, pp 313–318. https://doi.org/10.1109/AVSS.2014.6918687
Idrees H, Saleemi I, Seibert C et al (2013) Multi-source multi-scale counting in extremely dense crowd images. In: Proc of the IEEE Conf on Computer Vision and Pattern Recognition. IEEE, Piscataway, NJ, pp 2547–2554. https://doi.org/10.1109/CVPR.2013.329
Delussu R, Putzu L, Fumera G (2022) Scene-specific crowd counting using synthetic training images. Pattern Recog 124:108484. https://doi.org/10.1016/j.patcog.2021.108484
Yue X, Zhang C, Fujita H et al (2021) Clothing fashion style recognition with design issue graph. Appl Intell 51(6):3548–3560. https://doi.org/10.1007/s10489-020-01950-7
Lecun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Yu Y, Zhu H, Wang L et al (2021) Dense crowd counting based on adaptive scene division. Int J Mach Learn Cybern 12(4):931–942. https://doi.org/10.1007/s13042-020-01212-5
Liang L, Zhao H, Zhou F et al (2022) SC2Net: scale-aware crowd counting network with pyramid dilated convolution. Appl Intell 1–14. https://doi.org/10.1007/s10489-022-03648-4
Wang K, Liu M (2022) YOLOv3-MT: A YOLOv3 using multi-target tracking for vehicle visual detection. Appl Intell 52(2):2070–2091. https://doi.org/10.1007/s10489-021-02491-3
Xie J, Gu L, Li Z et al (2022) HRANet: Hierarchical region-aware network for crowd counting. Appl Intell 1–15. https://doi.org/10.1007/s10489-021-03030-w
Wang W, Liu Q, Wang W (2022) Pyramid-dilated deep convolutional neural network for crowd counting. Appl Intell 52(2):1825–1837. https://doi.org/10.1007/s10489-021-02537-6
Shi Y, Sang J, Wu Z et al (2022) MGSNet: A multi-scale and gated spatial attention network for crowd counting. Appl Intell 1–11. https://doi.org/10.1007/s10489-022-03263-3
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008. https://doi.org/10.1609/aaai.v34i07.6693
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy. Accessed 13 Jan 2021
Liu L, Chen J, Wu H et al (2021) Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4823–4833. https://doi.org/10.1109/CVPR46437.2021.00479
Dongze Lian, Jing Li, Jia Zheng, Weixin Luo, and Sheng hua Gao (2019) Density map regression guided detection network for RGB-D crowd counting and localization. In: CVPR, pp 1821–1830. https://doi.org/10.1109/CVPR.2019.00192
Gavrila D M, Philomin V (1999) Real-time object detection for “smart” vehicles. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol 1. IEEE, Kyoto, pp 87–93. https://doi.org/10.1109/ICCV.1999.791202
Zhang C, Li H , Wang X et al (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Boston, pp 833–841. https://doi.org/10.1109/CVPR.2015.7298684
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Yang S D, Su H T, Hsu W H et al (2019) DECCNet: Depth enhanced crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://doi.org/10.1109/ICCVW.2019.00553
Jiang X, Zhang L, Xu M et al (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4706–4715. https://doi.org/10.1109/CVPR42600.2020.00476
Ma Z, Wei X, Hong X et al (2019) Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6141–6150. https://doi.org/10.1109/ICCV.2019.00624. IEEE
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, Cham, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
He J, Chen JN, Liu S et al (2022) TransFG: A transformer architecture for fine-grained recognition. Proc AAAI Conf Artif Intel. 36(1):852–860. https://doi.org/10.1609/aaai.v36i1.19967
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Liu Z, Lin Y, Cao Y et al (2021) Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
Liang D, Chen X, Xu W et al (2022) Transcrowd: weakly-supervised crowd counting with transformers. Sci China Inf Sci 65(6):160104. https://doi.org/10.1007/s11432-021-3445-y
Gao J, Gong M, Li X (2022) Congested crowd instance localization with dilated convolutional Swin transformer. Neurocomputing 513:94–103. https://doi.org/10.1016/j.neucom.2022.09.113
Yuan L, Chen Y, Wang T et al (2021) Tokens-to-Token ViT: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 558–567. https://doi.org/10.1109/ICCV48922.2021.00060
Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Lu J, Batra D, Parikh D et al (2019) ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp 13–23. https://dl.acm.org/doi/10.5555/3454287.3454289. Curran Associates Inc., Red Hook, NY, United States
Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 4171–4186. https://doi.org/10.18653/v1/N19-1423. Association for Computational Linguistics
Ayetiran EF (2022) Attention-based aspect sentiment classification using enhanced learning through CNN-BiLSTM networks. Knowl-Based Syst 252:109409. https://doi.org/10.1016/j.knosys.2022.109409
Woo S, Park J, Lee J Y et al (2018) CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Fu J, Liu J, Tian H et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154. https://doi.org/10.1109/CVPR.2019.00326
Zhang P, Li T, Wang G et al (2021) Multi-source information fusion based on rough set theory: A review. Inf Fusion 68:85–117. https://doi.org/10.1016/j.inffus.2020.11.004
Li S, Kang X, Fang L et al (2017) Pixel-level image fusion: A survey of the state of the art. Inf Fusion 33:100–112. https://doi.org/10.1016/j.inffus.2016.05.004
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Antoni BC, Nuno V (2009) Bayesian Poisson regression for crowd counting. 2009 IEEE 12th international conference on computer vision. IEEE, Kyoto, pp 545–551
Zhang Y, Zhou D, Chen S et al (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597. https://doi.org/10.1109/CVPR.2016.70
Cao X, Wang Z, Zhao Y et al (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 734–750. https://doi.org/10.1007/978-3-030-01228-1-45
Li Y, Zhang X, Chen D (2018) CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100. https://doi.org/10.1109/CVPR.2018.00120
Zhang Q, Chan A B (2019) Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8297–8306. https://doi.org/10.1109/CVPR.2019.00849
Zhang J, Fan D P, Dai Y et al (2020) UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8582–8591. https://doi.org/10.1109/CVPR42600.2020.00861
Pang Y, Zhang L, Zhao X et al (2020) Hierarchical dynamic filtering network for RGB-D salient object detection. In: European Conference on Computer Vision. Springer, Cham, pp 235–252. https://doi.org/10.1007/978-3-030-58595-2_15
Fan D P, Zhai Y, Borji A et al (2020) BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: European Conference on Computer Vision. Springer, Cham, pp 275–292. https://doi.org/10.1007/978-3-030-58610-2_17
Liu J, Gao C, Meng D et al (2018) DecideNet: Counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5197–5206. https://doi.org/10.1109/CVPR.2018.00545
Idrees H, Tayyab M, Athrey K et al (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 532–546. https://doi.org/10.1007/978-3-030-01216-8-33
Acknowledgements
This paper was supported by the National Natural Science Foundation of China (No. 62163016, 62066014, 62202165), the Natural Science Foundation of Jiangxi Province (20212ACB202001, 20202BABL202018), and the Double Thousand Plan of Jiangxi Province of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, Y., Cai, Z., Miao, D. et al. An interactive network based on transformer for multimodal crowd counting. Appl Intell 53, 22602–22614 (2023). https://doi.org/10.1007/s10489-023-04721-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04721-2