An interactive network based on transformer for multimodal crowd counting

Yu, Ying; Cai, Zhen; Miao, Duoqian; Qian, Jin; Tang, Hong

doi:10.1007/s10489-023-04721-2

An interactive network based on transformer for multimodal crowd counting

Published: 30 June 2023

Volume 53, pages 22602–22614, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ying Yu ORCID: orcid.org/0000-0002-3480-4571¹,
Zhen Cai¹,
Duoqian Miao²,
Jin Qian¹ &
…
Hong Tang¹

393 Accesses
3 Citations
Explore all metrics

Abstract

Crowd counting is a task to estimate the total number of pedestrians in an image. In most of the existing research, good vision problems, such as in parks, squares, and bright shopping malls during the day, have been addressed. However, there is little research on complex scenes in darkness. To study this problem, we propose an interactive network based on Transformer for multi-modal crowd counting. First, sliding convolutional encoding is adopted for the image to obtain better encoding features. The features are extracted through the designed primary interaction network, and then channel token attention is used to modulate the features. Then, the FGAF-MLP is used for high and low semantic fusion to enhance the feature expression and fully fuse the data in different modes to improve the accuracy of the method. To verify the effectiveness of our method, we conducted extensive ablation experiments with the latest multimodal benchmark RGBT-CC, and we verified the complementarity between multiple modal data and the effectiveness of the model components. We also verified the effectiveness of our method with the ShanghaiTechRGBD benchmark. The experimental results showed that our proposed method exhibits good results and achieves an improvement of more than 10$\%$ in terms of the mean average error and mean squared error for the RGBT-CC benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Double multi-scale feature fusion network for crowd counting

Article 07 March 2024

GCMNet: Gated Cascade Multi-scale Network for Crowd Counting

Global vision, local focus: the semantic enhancement transformer network for crowd counting

Article 01 January 2025

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Kumar N, Raubal M (2021) Applications of deep learning in congestion detection, prediction and alleviation: A survey. Transp Res C Emerg Technol 133:103432. https://doi.org/10.1016/j.trc.2021.103432. Get rights and content
Bamaqa A, Sedky M, Bosakowski T et al (2022) SIMCD: SIMulated crowd data for anomaly detection and prediction. Expert Syst Appl 203:117475. https://doi.org/10.1016/j.eswa.2022.117475. Get rights and content
Fan Z, Zhang H, Zhang Z et al (2022) A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing 472:224–251. https://doi.org/10.1016/j.neucom.2021.02.103
Article Google Scholar
Topkaya I S, Erdogan H, Porikli F (2014) Counting people by clustering person detector outputs. In: Proc of the 11th IEEE Int Conf on Advanced Video and Signal Based Surveillance, IEEE, Piscataway, NJ, pp 313–318. https://doi.org/10.1109/AVSS.2014.6918687
Idrees H, Saleemi I, Seibert C et al (2013) Multi-source multi-scale counting in extremely dense crowd images. In: Proc of the IEEE Conf on Computer Vision and Pattern Recognition. IEEE, Piscataway, NJ, pp 2547–2554. https://doi.org/10.1109/CVPR.2013.329
Delussu R, Putzu L, Fumera G (2022) Scene-specific crowd counting using synthetic training images. Pattern Recog 124:108484. https://doi.org/10.1016/j.patcog.2021.108484
Article Google Scholar
Yue X, Zhang C, Fujita H et al (2021) Clothing fashion style recognition with design issue graph. Appl Intell 51(6):3548–3560. https://doi.org/10.1007/s10489-020-01950-7
Article Google Scholar
Lecun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Article Google Scholar
Yu Y, Zhu H, Wang L et al (2021) Dense crowd counting based on adaptive scene division. Int J Mach Learn Cybern 12(4):931–942. https://doi.org/10.1007/s13042-020-01212-5
Article Google Scholar
Liang L, Zhao H, Zhou F et al (2022) SC2Net: scale-aware crowd counting network with pyramid dilated convolution. Appl Intell 1–14. https://doi.org/10.1007/s10489-022-03648-4
Wang K, Liu M (2022) YOLOv3-MT: A YOLOv3 using multi-target tracking for vehicle visual detection. Appl Intell 52(2):2070–2091. https://doi.org/10.1007/s10489-021-02491-3
Article Google Scholar
Xie J, Gu L, Li Z et al (2022) HRANet: Hierarchical region-aware network for crowd counting. Appl Intell 1–15. https://doi.org/10.1007/s10489-021-03030-w
Wang W, Liu Q, Wang W (2022) Pyramid-dilated deep convolutional neural network for crowd counting. Appl Intell 52(2):1825–1837. https://doi.org/10.1007/s10489-021-02537-6
Article Google Scholar
Shi Y, Sang J, Wu Z et al (2022) MGSNet: A multi-scale and gated spatial attention network for crowd counting. Appl Intell 1–11. https://doi.org/10.1007/s10489-022-03263-3
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008. https://doi.org/10.1609/aaai.v34i07.6693
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy. Accessed 13 Jan 2021
Liu L, Chen J, Wu H et al (2021) Cross-modal collaborative representation learning and a large-scale RGBT benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4823–4833. https://doi.org/10.1109/CVPR46437.2021.00479
Dongze Lian, Jing Li, Jia Zheng, Weixin Luo, and Sheng hua Gao (2019) Density map regression guided detection network for RGB-D crowd counting and localization. In: CVPR, pp 1821–1830. https://doi.org/10.1109/CVPR.2019.00192
Gavrila D M, Philomin V (1999) Real-time object detection for “smart” vehicles. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol 1. IEEE, Kyoto, pp 87–93. https://doi.org/10.1109/ICCV.1999.791202
Zhang C, Li H , Wang X et al (2015) Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Boston, pp 833–841. https://doi.org/10.1109/CVPR.2015.7298684
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Yang S D, Su H T, Hsu W H et al (2019) DECCNet: Depth enhanced crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://doi.org/10.1109/ICCVW.2019.00553
Jiang X, Zhang L, Xu M et al (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4706–4715. https://doi.org/10.1109/CVPR42600.2020.00476
Ma Z, Wei X, Hong X et al (2019) Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6141–6150. https://doi.org/10.1109/ICCV.2019.00624. IEEE
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, Cham, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
He J, Chen JN, Liu S et al (2022) TransFG: A transformer architecture for fine-grained recognition. Proc AAAI Conf Artif Intel. 36(1):852–860. https://doi.org/10.1609/aaai.v36i1.19967
Article Google Scholar
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Google Scholar
Liu Z, Lin Y, Cao Y et al (2021) Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
Liang D, Chen X, Xu W et al (2022) Transcrowd: weakly-supervised crowd counting with transformers. Sci China Inf Sci 65(6):160104. https://doi.org/10.1007/s11432-021-3445-y
Article Google Scholar
Gao J, Gong M, Li X (2022) Congested crowd instance localization with dilated convolutional Swin transformer. Neurocomputing 513:94–103. https://doi.org/10.1016/j.neucom.2022.09.113
Article Google Scholar
Yuan L, Chen Y, Wang T et al (2021) Tokens-to-Token ViT: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 558–567. https://doi.org/10.1109/ICCV48922.2021.00060
Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Lu J, Batra D, Parikh D et al (2019) ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp 13–23. https://dl.acm.org/doi/10.5555/3454287.3454289. Curran Associates Inc., Red Hook, NY, United States
Devlin J, Chang M W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 4171–4186. https://doi.org/10.18653/v1/N19-1423. Association for Computational Linguistics
Ayetiran EF (2022) Attention-based aspect sentiment classification using enhanced learning through CNN-BiLSTM networks. Knowl-Based Syst 252:109409. https://doi.org/10.1016/j.knosys.2022.109409
Article Google Scholar
Woo S, Park J, Lee J Y et al (2018) CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Fu J, Liu J, Tian H et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154. https://doi.org/10.1109/CVPR.2019.00326
Zhang P, Li T, Wang G et al (2021) Multi-source information fusion based on rough set theory: A review. Inf Fusion 68:85–117. https://doi.org/10.1016/j.inffus.2020.11.004
Article Google Scholar
Li S, Kang X, Fang L et al (2017) Pixel-level image fusion: A survey of the state of the art. Inf Fusion 33:100–112. https://doi.org/10.1016/j.inffus.2016.05.004
Article Google Scholar
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Antoni BC, Nuno V (2009) Bayesian Poisson regression for crowd counting. 2009 IEEE 12th international conference on computer vision. IEEE, Kyoto, pp 545–551
Zhang Y, Zhou D, Chen S et al (2016) Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597. https://doi.org/10.1109/CVPR.2016.70
Cao X, Wang Z, Zhao Y et al (2018) Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 734–750. https://doi.org/10.1007/978-3-030-01228-1-45
Li Y, Zhang X, Chen D (2018) CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100. https://doi.org/10.1109/CVPR.2018.00120
Zhang Q, Chan A B (2019) Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8297–8306. https://doi.org/10.1109/CVPR.2019.00849
Zhang J, Fan D P, Dai Y et al (2020) UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8582–8591. https://doi.org/10.1109/CVPR42600.2020.00861
Pang Y, Zhang L, Zhao X et al (2020) Hierarchical dynamic filtering network for RGB-D salient object detection. In: European Conference on Computer Vision. Springer, Cham, pp 235–252. https://doi.org/10.1007/978-3-030-58595-2_15
Fan D P, Zhai Y, Borji A et al (2020) BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: European Conference on Computer Vision. Springer, Cham, pp 275–292. https://doi.org/10.1007/978-3-030-58610-2_17
Liu J, Gao C, Meng D et al (2018) DecideNet: Counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5197–5206. https://doi.org/10.1109/CVPR.2018.00545
Idrees H, Tayyab M, Athrey K et al (2018) Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 532–546. https://doi.org/10.1007/978-3-030-01216-8-33

Download references

Acknowledgements

This paper was supported by the National Natural Science Foundation of China (No. 62163016, 62066014, 62202165), the Natural Science Foundation of Jiangxi Province (20212ACB202001, 20202BABL202018), and the Double Thousand Plan of Jiangxi Province of China.

Author information

Authors and Affiliations

College of Software Engineering, East China Jiao tong University, Nanchang, 330013, China
Ying Yu, Zhen Cai, Jin Qian & Hong Tang
Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China
Duoqian Miao

Authors

Ying Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Cai
View author publications
You can also search for this author in PubMed Google Scholar
Duoqian Miao
View author publications
You can also search for this author in PubMed Google Scholar
Jin Qian
View author publications
You can also search for this author in PubMed Google Scholar
Hong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Yu.

Ethics declarations

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, Y., Cai, Z., Miao, D. et al. An interactive network based on transformer for multimodal crowd counting. Appl Intell 53, 22602–22614 (2023). https://doi.org/10.1007/s10489-023-04721-2

Download citation

Accepted: 21 May 2023
Published: 30 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04721-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An interactive network based on transformer for multimodal crowd counting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Double multi-scale feature fusion network for crowd counting

GCMNet: Gated Cascade Multi-scale Network for Crowd Counting

Global vision, local focus: the semantic enhancement transformer network for crowd counting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An interactive network based on transformer for multimodal crowd counting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Double multi-scale feature fusion network for crowd counting

GCMNet: Gated Cascade Multi-scale Network for Crowd Counting

Global vision, local focus: the semantic enhancement transformer network for crowd counting

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation