research-article

Reconstructing Close Human Interactions from Multiple Views

Authors:

Xiaowei ZhouAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 42, Issue 6

Article No.: 273, Pages 1 - 14

https://doi.org/10.1145/3618336

Published: 05 December 2023 Publication History

Abstract

This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras. The difficulty arises from the noisy or false 2D keypoint detections due to inter-person occlusion, the heavy ambiguity in associating keypoints to individuals due to the close interactions, and the scarcity of training data as collecting and annotating motion data in crowded scenes is resource-intensive. We introduce a novel system to address these challenges. Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies. The pose estimation component takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of each individual using a 3D conditional volumetric network. As the network doesn't need images as input, we can leverage known camera parameters from test scenes and a large quantity of existing motion capture data to synthesize massive training data that mimics the real data distribution in test scenes. Extensive experiments demonstrate that our approach significantly surpasses previous approaches in terms of pose accuracy and is generalizable across various camera setups and population sizes. The code is available on our project page: https://github.com/zju3dv/CloseMoCap.

Supplemental Material

MP4 File

supplemental

Download
265.25 MB

References

[1]

Vida Adeli, Ehsan Adeli, Ian Reid, Juan Carlos Niebles, and Hamid Rezatofighi. 2020. Socially and contextually aware human motion and pose forecasting. IEEE Robotics and Automation Letters 5, 4 (2020), 6033--6040.

[2]

Eduard Gabriel Bazavan, Andrei Zanfir, Mihai Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. 2021. HSPACE: Synthetic parametric humans animated in complex environments. arXiv preprint arXiv:2112.12867 (2021).

[3]

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2014. 3D pictorial structures for multiple human pose estimation. In CVPR. 1669--1676.

[4]

Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2015. 3d pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2015), 1929--1942.

Digital Library

[5]

Abdallah Benzine, Florian Chabot, Bertrand Luvison, Quoc Cuong Pham, and Catherine Achard. 2020. Pandanet: Anchor-based single-shot multi-person 3d pose estimation. In CVPR. 6856--6865.

[6]

Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. 2023. BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion. In CVPR. 8726--8737.

[7]

Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. 2022. HuMMan: Multi-modal 4d human dataset for versatile sensing and modeling. In ECCV. Springer, 557--577.

[8]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.

[9]

Junuk Cha, Muhammad Saqlain, GeonU Kim, Mingyu Shin, and Seungryul Baek. 2022. Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement. In ECCV. Springer, 660--677.

[10]

Anargyros Chatzitofis, Leonidas Saroglou, Prodromos Boutis, Petros Drakoulis, Nikolaos Zioulis, Shishir Subramanyam, Bart Kevelham, Caecilia Charbonnier, Pablo Cesar, Dimitrios Zarpalas, et al. 2020. HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media. IEEE Access 8 (2020), 176241--176262.

[11]

Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Rohith Mv, Stefan Stojanov, and James M Rehg. 2019. Unsupervised 3d pose estimation with geometric self-supervision. In CVPR. 5714--5724.

[12]

CMU Graphics Lab. 2000. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/.

[13]

Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2019. Fast and robust multi-person 3d pose estimation from multiple views. In CVPR. 7792--7801.

[14]

Dylan Drover, Rohith MV, Ching-Hang Chen, Amit Agrawal, Ambrish Tyagi, and Cong Phuoc Huynh. 2018. Can 3d pose be learned from 2d projections alone?. In ECCVW. 78--94.

[15]

Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2020. Three-dimensional reconstruction of human interactions. In CVPR. 7214--7223.

[16]

Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2021a. Learning complex 3d human self-contact. In AAAI. 1343--1351.

[17]

Mihai Fieraru, Mihai Zanfir, Silviu-Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. 2021b. AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training. In CVPR. 9919--9928.

[18]

Mihai Fieraru, Mihai Zanfir, Teodor Szente, Eduard Bazavan, Vlad Olaru, and Cristian Sminchisescu. 2021c. Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision. NeurIPS 34 (2021), 19385--19397.

[19]

Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje. 2020. MoVi: A Large Multipurpose Motion and Video Dataset. Borealis.

[20]

Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. 2022. Multi-person extreme motion prediction. In CVPR. 13053--13064.

[21]

Congzhentao Huang, Shuai Jiang, Yang Li, Ziyue Zhang, Jason Traish, Chen Deng, Sam Ferguson, and Richard Yi Da Xu. 2020. End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In ECCV. Springer, 477--493.

[22]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1325--1339.

Digital Library

[23]

Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnable triangulation of human pose. In ICCV. 7718--7727.

[24]

Glenn Jocher. 2020. Ultralytics YOLOv5.

[25]

Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. 2015. Panoptic studio: A massively multiview system for social motion capture. In ICCV. 3334--3342.

[26]

Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. 2021. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In 3DV. IEEE, 42--52.

[27]

Isinsu Katircioglu, Costa Georgantas, Mathieu Salzmann, and Pascal Fua. 2021. Dyadic human motion prediction. arXiv preprint arXiv:2112.00396 (2021).

[28]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.

[29]

Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In CVPR. 2252--2261.

[30]

Jiahao Lin and Gim Hee Lee. 2020. Hdnet: Human depth estimation for multi-person camera-space localization. In ECCV. Springer, 633--648.

[31]

Jiahao Lin and Gim Hee Lee. 2021. Multi-view multi-person 3d pose estimation with plane sweep stereo. In CVPR. 11886--11895.

[32]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In ICCV. 2980--2988.

[33]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.

[34]

Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. 2021. Neural actor: Neural free-view synthesis of human actors with pose control. ACM Transactions on Graphics 40, 6 (2021), 16 pages.

Digital Library

[35]

Qihao Liu, Yi Zhang, Song Bai, and Alan Yuille. 2022. Explicit Occlusion Reasoning for Multi-person 3D Human Pose Estimation. In ECCV. Springer, 497--517.

[36]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on Graphics 34, 6 (Nov 2015), 16 pages.

Digital Library

[37]

Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of motion capture as surface shapes. In ICCV. 5442--5451.

[38]

Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3DV. IEEE. http://gvv.mpi-inf.mpg.de/3dhp_dataset

[39]

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. 2020. XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera. ACM Transactions on Graphics 39, 4 (July 2020), 17 pages.

Digital Library

[40]

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Christian Theobalt. 2018. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV. 120--130.

[41]

Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, and Shunsuke Saito. 2022. KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In ECCV. Springer, 179--197.

[42]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99--106.

Digital Library

[43]

Gyeongsik Moon, Juyong Chang, and Kyoung Mu Lee. 2019. Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image. In ICCV. 10133--10142.

[44]

Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy. 2013. Berkeley mhad: A comprehensive multimodal human action database. In WACV. 53--60.

[45]

Priyanka Patel, Chun-Hao P Huang, Joachim Tesch, David T Hoffmann, Shashank Tripathi, and Michael J Black. 2021. AGORA: Avatars in geography optimized for regression analysis. In CVPR. 13468--13478.

[46]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In CVPR. 10975--10985.

[47]

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. 2021. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR. 9054--9063.

[48]

Zhongwei Qiu, Yang Qiansheng, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Chang Xu, Dongmei Fu, and Jingdong Wang. 2023. PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers. In CVPR.

[49]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In ICCV. 11488--11499.

[50]

Kathleen M Robinette, Sherri Blackwell, Hein Daanen, Mark Boehmer, Scott Fleming, Tina Brill, David Hoeferlin, and Dennis Burnsides. 2002. Civilian American and European surface anthropometry resource (CAESAR), final report, volume I: Summary. Sytronics Inc Dayton Oh (2002).

[51]

Qing Shuai, Chen Geng, Qi Fang, Sida Peng, Wenhao Shen, Xiaowei Zhou, and Hujun Bao. 2022. Novel view synthesis of human interactions from sparse multi-view videos. In SIGGRAPH. 1--10.

[52]

Leonid Sigal, Alexandru O Balan, and Michael J Black. 2010. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 87, 1--2 (2010), 4.

Digital Library

[53]

Jiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, and Yizhou Wang. 2022. VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data. In ECCV. Springer, 55--71.

[54]

Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J Black. 2022. Putting people in their place: Monocular regression of 3d people in depth. In CVPR. 13243--13252.

[55]

Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. 2017. Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In BMVC.

[56]

Hanyue Tu, Chunyu Wang, and Wenjun Zeng. 2020. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In ECCV. Springer, 197--212.

[57]

Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from synthetic humans. In CVPR. 109--117.

[58]

Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV. 601--617.

[59]

Bastian Wandt, Marco Rudolph, Petrissa Zell, Helge Rhodin, and Bodo Rosenhahn. 2021. CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild. In CVPR.

[60]

Can Wang, Jiefeng Li, Wentao Liu, Chen Qian, and Cewu Lu. 2020a. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV. Springer, 242--259.

[61]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020b. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3349--3364.

[62]

Tao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, and Jiashi Feng. 2021. Direct Multi-view Multi-person 3D Human Pose Estimation. NeurIPS 34 (2021), 13153--13164.

[63]

Zitian Wang, Xuecheng Nie, Xiaochao Qu, Yunpeng Chen, and Si Liu. 2022. Distribution-aware single-stage models for multi-person 3D pose estimation. In CVPR. 13096--13105.

[64]

Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. In CVPR. 16210--16220.

[65]

Size Wu, Sheng Jin, Wentao Liu, Lei Bai, Chen Qian, Dong Liu, and Wanli Ouyang. 2021. Graph-based 3d multi-person pose estimation using multi-view images. In ICCV. 11148--11157.

[66]

Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. 2020. GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models. In CVPR. 6184--6193.

[67]

Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. 2022. Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In ECCV. Springer, 142--159.

[68]

Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. 2023. Decoupling Human and Camera Motion from Videos in the Wild. In CVPR.

[69]

Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Zarate, Jie Song, and Otmar Hilliges. 2023. Hi4D: 4D Instance Segmentation of Close Human Interaction. In CVPR. 17016--17027.

[70]

Jae Shin Yoon, Zhixuan Yu, Jaesik Park, and Hyun Soo Park. 2021. Humbi: A large multiview dataset of human body expressions and benchmark challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2021), 623--640.

[71]

Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. 2022. GLAMR: Global occlusion-aware human mesh recovery with dynamic cameras. In CVPR. 11038--11049.

[72]

Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 2020. 4D association graph for realtime multi-person motion capture using multiple video cameras. In CVPR. 1324--1333.

[73]

Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. 2020. Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV. Springer, 550--566.

[74]

Zhize Zhou, Qing Shuai, Yize Wang, Qi Fang, Xiaopeng Ji, Fashuai Li, Hujun Bao, and Xiaowei Zhou. 2022. QuickPose: Real-time Multi-view Multi-person Pose Estimation in Crowded Scenes. In SIGGRAPH. 1--9.

Cited By

Eliseev SShtanko LAkhunzianov RRomanenko YStarostin A(2024)MV2MP: Segmentation Free Performance Capture of Humans in Direct Physical Contact from Sparse Multi-Cam SetupsComputer Vision – ACCV 202410.1007/978-981-96-0969-7_5(71-87)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0969-7_5
Lu FDong ZSong JHilliges O(2024)AvatarPose: Avatar-Guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view VideosComputer Vision – ECCV 202410.1007/978-3-031-73668-1_13(215-233)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73668-1_13

Index Terms

Reconstructing Close Human Interactions from Multiple Views
1. Computing methodologies
  1. Computer graphics
    1. Animation
      1. Motion capture

Recommendations

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates
Abstract
3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated ...
Reconstructing complex surfaces from multiple stereo views
ICCV '95: Proceedings of the Fifth International Conference on Computer Vision

We present a framework for 3D surface reconstruction that can be used to model fully 3 dimensional scenes from an arbitrary number of stereo views. Taken from vastly different viewpoints. This is a key step toward producing 3D world descriptions of ...
Markerless tracking of complex human motions from multiple views
Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour

We present a method for markerless tracking of complex human motions from multiple camera views. In the absence of markers, the task of recovering the pose of a person during such motions is challenging and requires strong image features and robust ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 42, Issue 6

December 2023

1565 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3632123

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2023

Published in TOG Volume 42, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
290
Total Downloads

Downloads (Last 12 months)193
Downloads (Last 6 weeks)8

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Eliseev SShtanko LAkhunzianov RRomanenko YStarostin A(2024)MV2MP: Segmentation Free Performance Capture of Humans in Direct Physical Contact from Sparse Multi-Cam SetupsComputer Vision – ACCV 202410.1007/978-981-96-0969-7_5(71-87)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0969-7_5
Lu FDong ZSong JHilliges O(2024)AvatarPose: Avatar-Guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view VideosComputer Vision – ECCV 202410.1007/978-3-031-73668-1_13(215-233)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73668-1_13

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents