Multi-modal Crowd Counting via a Broker Modality

Meng, Haoliang; Hong, Xiaopeng; Wang, Chenhao; Shang, Miao; Zuo, Wangmeng

doi:10.1007/978-3-031-72904-1_14

Haoliang Meng¹³,
Xiaopeng Hong^13,14,
Chenhao Wang¹³,
Miao Shang¹³ &
…
Wangmeng Zuo^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

226 Accesses

Abstract

Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting

Cross-modal collaborative representation and multi-level supervision for crowd counting

Article 09 June 2022

Spatio-Channel Attention Blocks for Cross-modal Crowd Counting

Notes

1.
For the infrared-visible fusion task, the DDFM model retains both structural and detailed information from the source images, meeting the visual fidelity requirements as well. Thus it is used to pre-train our broker modal generator based on its favorable multi-modal fusion results.
2.
To better illustrate the challenges posed by this problem, we conducted experiments to assess the effectiveness of image alignment-based de-ghosting algorithms such as [4, 9, 30]. Our results indicate that while satisfactory outcomes can be attained with natural image pairs, the de-ghosting algorithm performs inadequately on image pairs with low imaging quality, such as the data used in this study. Further comprehensive experiments are available in the supplementary material.

References

Alehdaghi, M., Josi, A., Shamsolmoali, P., Cruz, R.M., Granger, E.: Adaptive generation of privileged intermediate information for visible-infrared person re-identification. arXiv preprint arXiv:2307.03240 (2023)
Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276–11286 (2021)
Google Scholar
Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 275–292. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_17
Chapter Google Scholar
Gao, J., Cai, X.F.: Image matching method based on multi-scale corner detection. In: 2017 13th International Conference on Computational Intelligence and Security (CIS), pp. 125–129. IEEE (2017)
Google Scholar
Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., Oñoro-Rubio, D.: Extremely overlapping vehicle counting. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) IbPRIA 2015. LNCS, vol. 9117, pp. 423–431. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19390-8_48
Chapter Google Scholar
Guo, Q., Yuan, P., Huang, X., Ye, Y.: Consistency-constrained RGB-T crowd counting via mutual information maximization. Complex Intell. Syst. 1–22 (2024)
Google Scholar
Huang, Z., Liu, J., Fan, X., Liu, R., Zhong, W., Luo, Z.: ReCoNet: recurrent correction network for fast and efficient multi-modality image fusion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13678, pp. 539–555. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19797-0_31
Idrees, H., et al.: Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546 (2018)
Google Scholar
Jiang, Q., et al.: A contour angle orientation for power equipment infrared and visible image registration. IEEE Trans. Power Deliv. 36(4), 2559–2569 (2020)
Article Google Scholar
Kong, W., Liu, J., Hong, Y., Li, H., Shen, J.: Cross-modal collaborative feature representation via transformer-based multimodal mixers for RGB-T crowd counting. Expert Syst. Appl. 124483 (2024)
Google Scholar
Li, D., Wei, X., Hong, X., Gong, Y.: Infrared-visible cross-modal person re-identification with an x modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4610–4617 (2020)
Google Scholar
Li, H., Zhang, S., Kong, W.: Learning the cross-modal discriminative feature representation for RGB-T crowd counting. Knowl.-Based Syst. 257, 109944 (2022)
Article Google Scholar
Li, H., Zhang, S., Kong, W.: RGB-D crowd counting with cross-modal cycle-attention fusion and fine-coarse supervision. IEEE Trans. Ind. Inf. 19(1), 306–316 (2022)
Article Google Scholar
Li, Y.C.: Dilated convolutional neural networks for understanding the highly congested scenes/y. li, x. zhang, d. chen. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.–IEEE, pp. 1091–1100 (2018)
Google Scholar
Li, Y., Wang, H., Luo, Y.: A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1999–2004. IEEE (2020)
Google Scholar
Lian, D., Chen, X., Li, J., Luo, W., Gao, S.: Locating and counting heads in crowds with a depth prior. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9056–9072 (2021)
Article Google Scholar
Lian, D., Li, J., Zheng, J., Luo, W., Gao, S.: Density map regression guided detection network for RGB-D crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1821–1830 (2019)
Google Scholar
Lin, H., et al.: Direct measure matching for crowd counting. In: The Thirtieth International Joint Conference on Artificial Intelligence (2021)
Google Scholar
Lin, H., Ma, Z., Hong, X., Shangguan, Q., Meng, D.: Gramformer: learning crowd counting via graph-modulated transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 3395–3403 (2024)
Google Scholar
Lin, H., Ma, Z., Hong, X., Wang, Y., Su, Z.: Semi-supervised crowd counting via density agency. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1416–1426 (2022)
Google Scholar
Lin, H., Ma, Z., Ji, R., Wang, Y., Hong, X.: Boosting crowd counting via multifaceted attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19628–19637 (2022)
Google Scholar
Liu, C., Lu, H., Cao, Z., Liu, T.: Point-query quadtree for crowd counting, localization, and more. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1676–1685 (2023)
Google Scholar
Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2018)
Google Scholar
Liu, L., Chen, J., Wu, H., Li, G., Li, C., Lin, L.: Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4823–4833 (2021)
Google Scholar
Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar
Liu, L., Wang, H., Li, G., Ouyang, W., Lin, L.: Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601 (2018)
Liu, Y., Liu, L., Wang, P., Zhang, P., Lei, Y.: Semi-supervised crowd counting via self-training on surrogate tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_15
Chapter Google Scholar
Liu, Y., Cao, G., Shi, B., Hu, Y.: Ccanet: a collaborative cross-modal attention network for RGB-D crowd counting. IEEE Trans. Multimed. (2023)
Google Scholar
Liu, Z., Wu, W., Tan, Y., Zhang, G.: RGB-T multi-modal crowd counting based on transformer. In: The 33rd British Machine Vision Conference 2022 (2022)
Google Scholar
Ma, J., Zhou, H., Zhao, J., Gao, Y., Jiang, J., Tian, J.: Robust feature matching for remote sensing image registration via locally linear transforming. IEEE Trans. Geosci. Remote Sens. 53(12), 6469–6481 (2015)
Article Google Scholar
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6142–6151 (2019)
Google Scholar
Ma, Z., Wei, X., Hong, X., Gong, Y.: Learning scales from points: a scale-aware probabilistic model for crowd counting. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 220–228 (2020)
Google Scholar
Ma, Z., Wei, X., Hong, X., Lin, H., Qiu, Y., Gong, Y.: Learning to count via unbalanced optimal transport. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2319–2327 (2021)
Google Scholar
Mo, H., et al.: Attention-guided collaborative counting. IEEE Trans. Image Process. 31, 6306–6319 (2022)
Article Google Scholar
Mu, B., Shao, F., Xie, Z., Chen, H., Jiang, Q., Ho, Y.S.: Visual prompt multi-branch fusion network for rgb-thermal crowd counting. IEEE Internet Things J. (2024)
Google Scholar
Pan, Y., Zhou, W., Fang, M., Qiang, F.: Graph enhancement and transformer aggregation network for rgb-thermal crowd counting. IEEE Geosci. Remote Sens. Lett. (2024)
Google Scholar
Pan, Y., Zhou, W., Qian, X., Mao, S., Yang, R., Yu, L.: CGINet: cross-modality grade interaction network for RGB-T crowd counting. Eng. Appl. Artif. Intell. 126, 106885 (2023)
Article Google Scholar
Pang, Y., Zhang, L., Zhao, X., Lu, H.: Hierarchical dynamic filtering network for RGB-D salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 235–252. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_15
Chapter Google Scholar
Peng, T., Li, Q., Zhu, P.: RGB-T crowd counting from drone: a benchmark and mmccn network. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Ren, S., Du, Y., Lv, J., Han, G., He, S.: Learning from the master: distilling cross-modal advanced knowledge for lip reading. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13325–13333 (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Sam, D.B., Peri, S.V., Sundararaman, M.N., Kamath, A., Babu, R.V.: Locate, size, and count: accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2739–2751 (2020)
Google Scholar
Sindagi, V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1002–1012 (2019)
Google Scholar
Tang, H., Wang, Y., Chau, L.P.: Tafnet: a three-stream adaptive fusion network for RGB-T crowd counting. In: 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 3299–3303. IEEE (2022)
Google Scholar
Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans. Image Process. 30, 2876–2887 (2021)
Article Google Scholar
Wang, Z., Wang, Z., Zheng, Y., Chuang, Y.Y., Satoh, S.: Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 618–626 (2019)
Google Scholar
Wei, X., Li, D., Hong, X., Ke, W., Gong, Y.: Co-attentive lifting for infrared-visible person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1028–1037 (2020)
Google Scholar
Wu, Z., Liu, L., Zhang, Y., Mao, M., Lin, L., Li, G.: Multimodal crowd counting with mutual attention transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Google Scholar
Xie, Z., et al.: Cross-modality double bidirectional interaction and fusion network for RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(8), 4149–4163 (2023)
Article Google Scholar
Xie, Z., et al.: Bgdfnet: bidirectional gated and dynamic fusion network for rgb-t crowd counting in smart city system. IEEE Trans. Instrum. Meas. (2024)
Google Scholar
Xu, H., Yuan, J., Ma, J.: MURF: mutually reinforcing multi-modal image registration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Google Scholar
Yang, X., Zhou, W., Yan, W., Qian, X.: Cagnet: coordinated attention guidance network for rgb-t crowd counting. Expert Syst. Appl. 243, 122753 (2024)
Article Google Scholar
Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Reverse perspective network for perspective-aware object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4374–4383 (2020)
Google Scholar
Yu, L., et al.: Commercemm: large-scale commerce multimodal representation learning with omni retrieval. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4433–4442 (2022)
Google Scholar
Zhang, B., Du, Y., Zhao, Y., Wan, J., Tong, Z.: I-mmccn: improved mmccn for rgb-t crowd counting of drone images. In: 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), pp. 117–121. IEEE (2021)
Google Scholar
Zhang, J., et al.: UC-Net: uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8582–8591 (2020)
Google Scholar
Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8297–8306 (2019)
Google Scholar
Zhang, Y., Choi, S., Hong, S.: Spatio-channel attention blocks for cross-modal crowd counting. In: Proceedings of the Asian Conference on Computer Vision, pp. 90–107 (2022)
Google Scholar
Zhang, Y., Yan, Y., Lu, Y., Wang, H.: Towards a unified middle modality learning for visible-infrared person re-identification. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 788–796 (2021)
Google Scholar
Zhao, W., Xie, S., Zhao, F., He, Y., Lu, H.: Metafusion: infrared and visible image fusion via meta-feature embedding from object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13955–13965 (2023)
Google Scholar
Zhao, Z., et al.: Cddfuse: correlation-driven dual-branch feature decomposition for multi-modality image fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5906–5916 (2023)
Google Scholar
Zhao, Z., et al.: DDFM: denoising diffusion model for multi-modality image fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8082–8093 (2023)
Google Scholar
Zhou, W., Pan, Y., Lei, J., Ye, L., Yu, L.: Defnet: dual-branch enhanced feature fusion network for rgb-t crowd counting. IEEE Trans. Intell. Transp. Syst. 23(12), 24540–24549 (2022)
Article Google Scholar
Zhou, W., Yang, X., Dong, X., Fang, M., Yan, W., Luo, T.: Mjpnet-s*: multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities. IEEE Internet Things J. (2024)
Google Scholar
Zhou, W., Yang, X., Lei, J., Yan, W., Yu, L.: ${\rm MC}^{3}{\rm Net}$: multimodality cross-guided compensation coordination network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. (2023)
Google Scholar

Download references

Acknolwedgement

This work was funded in part by the National Natural Science Foundation of China (62076195, 62376070) and in part by the Fundamental Research Funds for the Central Universities (AUGA5710011522).

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang & Wangmeng Zuo
Pengcheng Laboratory, Shenzhen, China
Xiaopeng Hong & Wangmeng Zuo

Authors

Haoliang Meng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaopeng Hong
View author publications
You can also search for this author in PubMed Google Scholar
Chenhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Miao Shang
View author publications
You can also search for this author in PubMed Google Scholar
Wangmeng Zuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaopeng Hong .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2009 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meng, H., Hong, X., Wang, C., Shang, M., Zuo, W. (2025). Multi-modal Crowd Counting via a Broker Modality. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_14
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Crowd Counting via a Broker Modality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting

Cross-modal collaborative representation and multi-level supervision for crowd counting

Spatio-Channel Attention Blocks for Cross-modal Crowd Counting

Notes

References

Acknolwedgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2009 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multi-modal Crowd Counting via a Broker Modality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CrowdFusion: Refined Cross-Modal Fusion Network for RGB-T Crowd Counting

Cross-modal collaborative representation and multi-level supervision for crowd counting

Spatio-Channel Attention Blocks for Cross-modal Crowd Counting

Notes

References

Acknolwedgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2009 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation