Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681560acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unsupervised Multi-view Pedestrian Detection

Published: 28 October 2024 Publication History

Abstract

With the prosperity of the intelligent surveillance, multiple cameras have been applied to localize pedestrians more accurately. However, previous methods rely on laborious annotations of pedestrians in every frame and camera view. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to learn an annotation-free detector via vision-language models and 2D-3D cross-modal mapping: 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract unsupervised representations of multi-view images, which are converted into 2D masks as pseudo labels, via our proposed iterative PCA and zero-shot semantic classes from vision-language models; 2) Secondly, we propose Geometry-aware Volume-based Detector (GVD) to end-to-end encode multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D rendering losses with SIS pseudo labels; 3) Thirdly, for better detection results, i.e., the 3D density projected on Birds-Eye-View, we propose Vertical-aware BEV Regularization (VBR) to constrain pedestrians to be vertical like the natural poses. Extensive experiments on popular multi-view pedestrian detection benchmarks Wildtrack, Terrace, and MultiviewX, show that our proposed UMPD, as the first fully-unsupervised method to our best knowledge, performs competitively to the previous state-of-the-art supervised methods. Code is available at https://github.com/lmy98129/UMPD.

References

[1]
Alexandre Alahi, Vignesh Ramanathan, and Li Fei-Fei. 2014. Socially-aware large-scale crowd forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2203--2210.
[2]
Pierre Baqué, Franccois Fleuret, and Pascal Fua. 2017. Deep occlusion reasoning for multi-camera multi-target detection. In Proceedings of the IEEE International Conference on Computer Vision. 271--279.
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650--9660.
[4]
Tatjana Chavdarova, Pierre Baqué, Stéphane Bouquet, Andrii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Franccois Fleuret. 2018. Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5030--5039.
[5]
Tatjana Chavdarova and Franccois Fleuret. 2017. Deep multi-camera people detection. In 2017 16th IEEE international conference on machine learning and applications (ICMLA). IEEE, 848--853.
[6]
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1290--1299.
[7]
Martin Engilberge, Haixin Shi, Zhiye Wang, and Pascal Fua. 2023. Two-level Data Augmentation for Calibrated Multi-view Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 128--136.
[8]
Francois Fleuret, Jerome Berclaz, Richard Lengagne, and Pascal Fua. 2007. Multicamera people tracking with a probabilistic occupancy map. IEEE transactions on pattern analysis and machine intelligence, Vol. 30, 2 (2007), 267--282.
[9]
Ismail Haritaoglu, Myron Flickner, and David Beymer. 2013. Video-CRM: understanding customer behaviors in stores. In Video Surveillance and Transportation Imaging Applications, Vol. 8663. SPIE, 266--271.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[11]
Yunzhong Hou and Liang Zheng. 2021. Multiview detection with shadow transformer (and view-coherent data augmentation). In Proceedings of the 29th ACM International Conference on Multimedia. 1673--1682.
[12]
Yunzhong Hou, Liang Zheng, and Stephen Gould. 2020. Multiview detection with feature perspective transformation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 1--18.
[13]
Peter J Huber. 1992. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution (1992), 492--518.
[14]
Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision. 7718--7727.
[15]
Justin Johnson, Nikhila Ravi, Jeremy Reizenstein, David Novotny, Shubham Tulsiani, Christoph Lassner, and Steve Branson. 2020. Accelerating 3d deep learning with pytorch3d. In SIGGRAPH Asia 2020 Courses. 1--1.
[16]
Rangachar Kasturi, Dmitry Goldgof, Padmanabhan Soundararajan, Vasant Manohar, John Garofolo, Rachel Bowers, Matthew Boonstra, Valentina Korzhova, and Jing Zhang. 2008. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 2 (2008), 319--336.
[17]
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. 2023. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19729--19739.
[18]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.
[19]
Jingliang Li, Zhengda Lu, Yiqun Wang, Ying Wang, and Jun Xiao. 2022. Ds-mvsnet: Unsupervised multi-view stereo via depth synthesis. In Proceedings of the 30th ACM International Conference on Multimedia. 5593--5601.
[20]
Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. 2023. Crowdclip: Unsupervised crowd counting via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2893--2903.
[21]
Mengyin Liu, Jie Jiang, Chao Zhu, and Xu-Cheng Yin. 2023. VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6662--6671.
[22]
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
[23]
Xinhang Liu, Jiaben Chen, Huai Yu, Yu-Wing Tai, and Chi-Keung Tang. 2022. Unsupervised multi-view object segmentation using radiance field propagation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 17730--17743.
[24]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part I 16. Springer, 405--421.
[25]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193 (2023).
[26]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019).
[27]
Rui Qiu, Ming Xu, Yuyao Yan, Jeremy S Smith, and Xi Yang. 2022. 3D Random Occlusion and Multi-layer Projection for Deep Multi-camera Pedestrian Localization. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part X. Springer, 695--710.
[28]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[29]
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. 2024. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024).
[30]
Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2022. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2397--2406.
[31]
Liangchen Song, Jialian Wu, Ming Yang, Qian Zhang, Yuan Li, and Junsong Yuan. 2021. Stacked homography transformations for multi-view pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6049--6057.
[32]
Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, and Min Sun. 2023. ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6996--7007.
[33]
Jeet Vora, Swetanjal Dutta, Kanishk Jain, Shyamgopal Karthik, and Vineet Gandhi. 2023. Bringing generalization to deep multi-view pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 110--119.
[34]
Xudong Wang, Rohit Girdhar, Stella X Yu, and Ishan Misra. 2023. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3124--3134.
[35]
Kaiqiang Xiong, Rui Peng, Zhe Zhang, Tianxing Feng, Jianbo Jiao, Feng Gao, and Ronggang Wang. 2023. Cl-MVSNet: Unsupervised multi-view stereo with dual-level contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3769--3780.
[36]
Yuanlu Xu, Xiaobai Liu, Yang Liu, and Song-Chun Zhu. 2016. Multi-view people tracking via hierarchical trajectory composition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4256--4265.
[37]
Qi Zhang and Antoni B Chan. 2019. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8297--8306.
[38]
Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B Chan, and Hui Huang. 2024. Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 7242--7250.
[39]
Qi Zhang, Wei Lin, and Antoni B Chan. 2021. Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 557--567.
[40]
Chong Zhou, Chen Change Loy, and Bo Dai. 2022. Extract free dense labels from clip. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXVIII. Springer, 696--712.

Cited By

View all
  • (2025)Synergistic Fusion: Vision-Language Models in Advancing Autonomous Driving and Intelligent Transportation SystemsBuilding Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains10.1007/978-3-031-68256-8_9(205-221)Online publication date: 19-Jan-2025

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-view pedestrian detection
  2. unsupervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)15
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Synergistic Fusion: Vision-Language Models in Advancing Autonomous Driving and Intelligent Transportation SystemsBuilding Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains10.1007/978-3-031-68256-8_9(205-221)Online publication date: 19-Jan-2025

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media