research-article

GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction

Authors:

Dacheng TaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2817 - 2828

https://doi.org/10.1145/3581783.3612276

Published: 27 October 2023 Publication History

Abstract

Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions. The code will be available at https://github.com/xymsh/GraMMaR.

Supplemental Material

MP4 File

The presentation video for "GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction".

Download
99.84 MB

References

[1]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 561--578.

[2]

Magnus Burenius, Josephine Sullivan, and Stefan Carlsson. 2013. 3d pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3618--3625.

Digital Library

[3]

Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. 2019. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision. 2272--2281.

[4]

Polona Caserman, Augusto Garcia-Agundez, and Stefan Göbel. 2019. A survey of full-body motion reconstruction in immersive virtual reality applications. IEEE transactions on visualization and computer graphics, Vol. 26, 10 (2019), 3089--3108.

[5]

Kang Chen, Zhipeng Tan, Jin Lei, Song-Hai Zhang, Yuan-Chen Guo, Weidong Zhang, and Shi-Min Hu. 2021. Choreomaster: choreography-oriented music-driven dance synthesis. ACM Transactions on Graphics (TOG), Vol. 40, 4 (2021), 1--13.

Digital Library

[6]

Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. 2022. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In European Conference on Computer Vision. Springer, 342--359.

Digital Library

[7]

Rishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, Abhishek Sharma, and Arjun Jain. 2018. Learning 3d human pose from structure and motion. In Proceedings of the European conference on computer vision (ECCV). 668--683.

Digital Library

[8]

Rishabh Dabral, Soshi Shimada, Arjun Jain, Christian Theobalt, and Vladislav Golyanik. 2021. Gravity-aware monocular 3d human-object reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12365--12374.

[9]

Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, and Cordelia Schmid. 2017. Blitznet: A real-time deep network for scene understanding. In Proceedings of the IEEE international conference on computer vision. 4154--4162.

[10]

Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Tompson, Leonid Pishchulin, Micha Andriluka, Chris Bregler, Bernt Schiele, and Christian Theobalt. 2015a. Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3810--3818.

[11]

Ahmed Elhayek, Carsten Stoll, Kwang In Kim, and Christian Theobalt. 2015b. Outdoor human motion capture by simultaneous optimization of pose and camera parameters. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 86--98.

[12]

Erik Gärtner, Mykhaylo Andriluka, Erwin Coumans, and Cristian Sminchisescu. 2022. Differentiable dynamics for articulated 3d human motion reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13190--13200.

[13]

Kehong Gong, Bingbing Li, Jianfeng Zhang, Tao Wang, Jing Huang, Michael Bi Mi, Jiashi Feng, and Xinchao Wang. 2022. PoseTriplet: co-evolving 3D human pose estimation, imitation, and hallucination under self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11017--11027.

[14]

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. 2019. Resolving 3D Human Pose Ambiguities with 3D Scene Constraints. In International Conference on Computer Vision. 2282--2292. https://prox.is.tue.mpg.de

[15]

Buzhen Huang, Liang Pan, Yuan Yang, Jingyi Ju, and Yangang Wang. 2022a. Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6417--6426.

[16]

Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. 2022b. Capturing and Inferring Dense Full-Body Human-Scene Contact. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 13274--13285.

[17]

Sena Kiciroglu, Helge Rhodin, Sudipta N Sinha, Mathieu Salzmann, and Pascal Fua. 2020. Activemocap: Optimized viewpoint selection for active human motion capture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 103--112.

[18]

Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. 2020. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5253--5263.

[19]

Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. 2021. PARE: Part Attention Regressor for 3D Human Body Estimation. In Proceedings International Conference on Computer Vision (ICCV). IEEE, 11127--11137.

[20]

Taein Kwon, Bugra Tekin, Siyu Tang, and Marc Pollefeys. 2022. Context-Aware Sequence Alignment using 4D Skeletal Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8172--8182.

[21]

Buyu Li, Yongchi Zhao, Shi Zhelun, and Lu Sheng. 2022b. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1272--1279.

[22]

Jiefeng Li, Siyuan Bian, Qi Liu, Jiasheng Tang, Fan Wang, and Cewu Lu. 2023. NIKI: Neural Inverse Kinematics with Invertible Neural Networks for 3D Human Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12933--12942.

[23]

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021a. Ai choreographer: Music conditioned 3d dance generation with aist. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13401--13412.

[24]

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021b. Learn to Dance with AIST: Music Conditioned 3D Dance Generation. arxiv: 2101.08779 [cs.CV]

[25]

Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. 2022a. Cliff: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision. Springer, 590--606.

Digital Library

[26]

Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, and Vijayan Asari. 2020. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5064--5073.

[27]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), Vol. 34, 6 (Oct. 2015), 248:1--248:16.

Digital Library

[28]

Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. 2021. Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 25019--25032.

[29]

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442--5451.

[30]

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. 2021. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning.

[31]

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. 2020. XNect: Real-time multi-person 3D motion capture with a single RGB camera. Acm Transactions On Graphics (TOG), Vol. 39, 4 (2020), 82--1.

Digital Library

[32]

Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. Acm transactions on graphics (tog), Vol. 36, 4 (2017), 1--14.

Digital Library

[33]

Yamin Mo, Sihan Ma, Haoran Gong, Zhe Chen, Jing Zhang, and Dacheng Tao. 2021. Terra: A smart and sensible digital twin framework for robust robot deployment in challenging environments. IEEE Internet of Things Journal, Vol. 8, 18 (2021), 14039--14050.

[34]

Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J. Mitra. 2019. iMapper: Interaction-guided Scene Mapping from Monocular Videos. ACM SIGGRAPH (2019).

[35]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975--10985.

[36]

Ken Pfeuffer, Matthias J Geiger, Sarah Prange, Lukas Mecke, Daniel Buschek, and Florian Alt. 2019. Behavioural biometrics in vr: Identifying people from body motion and relations in virtual reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--12.

Digital Library

[37]

Francesco Pilati, Maurizio Faccio, Mauro Gamberi, and Alberto Regattieri. 2020. Learning manual assembly through real-time motion capture for operator training with augmented reality. Procedia Manufacturing, Vol. 45 (2020), 189--195.

[38]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. 2021. Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision. 11488--11499.

[39]

Davis Rempe, Leonidas J Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. 2020. Contact and human dynamics from monocular video. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V 16. Springer, 71--87.

Digital Library

[40]

Soshi Shimada, Vladislav Golyanik, Weipeng Xu, Patrick Pérez, and Christian Theobalt. 2021. Neural monocular 3d human motion capture with physical awareness. ACM Transactions on Graphics (ToG), Vol. 40, 4 (2021), 1--15.

Digital Library

[41]

Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics (ToG), Vol. 39, 6 (2020), 1--16.

Digital Library

[42]

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, Vol. 28 (2015).

[43]

Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S Kankanhalli, Weidong Geng, and Xiangdong Li. 2020. DeepDance: music-to-dance motion choreography with adversarial learning. IEEE Transactions on Multimedia, Vol. 23 (2020), 497--509.

[44]

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 5026--5033. https://doi.org/10.1109/IROS.2012.6386109

[45]

Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Taheri Omid, Michael J. Black, and Dimitrios Tzionas. 2023. 3D Human Pose Estimation via Intuitive Physics. In Conference on Computer Vision and Pattern Recognition (CVPR). https://ipman.is.tue.mpg.de

[46]

Kuan-Chieh Wang, Zhenzhen Weng, Maria Xenochristou, Jo ao Pedro Araújo, Jeffrey Gu, Karen Liu, and Serena Yeung. 2023. NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22129--22138.

[47]

Kevin Xie, Tingwu Wang, Umar Iqbal, Yunrong Guo, Sanja Fidler, and Florian Shkurti. 2021. Physics-based human motion estimation and synthesis from videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11532--11541.

[48]

Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and Wenjun Zhang. 2020. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition. 899--908.

[49]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022a. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 38571--38584.

[50]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022b. Vitpose: Vision transformer foundation model for generic body pose estimation. arXiv preprint arXiv:2212.04246 (2022).

[51]

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. 2023. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 469--480.

[52]

Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. 2021. Simpoe: Simulated character control for 3d human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7159--7169.

[53]

Jing Zhang and Dacheng Tao. 2020. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal, Vol. 8, 10 (2020), 7789--7817.

[54]

Siwei Zhang, Yan Zhang, Federica Bogo, Marc Pollefeys, and Siyu Tang. 2021. Learning motion priors for 4d human body capture in 3d scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11343--11353.

[55]

Haimei Zhao, Jing Zhang, Sen Zhang, and Dacheng Tao. 2022. Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes. In European Conference on Computer Vision. Springer, 708--726.

Digital Library

[56]

Li'an Zhuo, Jian Cao, Qi Wang, Bang Zhang, and Liefeng Bo. 2023. Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 650--659.

Cited By

Liu YCao QWen YJiang HDing C(2024)Towards Variable and Coordinated Holistic Co-Speech Motion Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00155(1566-1576)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00155

Index Terms

GraMMaR: Ground-aware Motion Model for 3D Human Motion Reconstruction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
      2. Image and video acquisition
        Motion capture

Recommendations

Model based full body human motion reconstruction from video data
MIRAGE '13: Proceedings of the 6th International Conference on Computer Vision / Computer Graphics Collaboration Techniques and Applications

This paper introduces a novel framework for full body human motion reconstruction from 2D video data using a motion capture database as knowledge base containing information on how people move. By extracting suitable two-dimensional features from both, ...
Motion reconstruction using very few accelerometers and ground contacts

Due to the rapid development in sensor technology, the recording of human motion sequences is making its way out of controlled studio environments. Accelerometers are available in a broad range of devices that can be used practically everywhere. In ...
Human Motion Analysis

Human motion analysis is receiving increasing attention from computer vision researchers. This interest is motivated by a wide spectrum of applications, such as athletic performance analysis, surveillance, man machine interfaces, content-based image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
112
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)6

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YCao QWen YJiang HDing C(2024)Towards Variable and Coordinated Holistic Co-Speech Motion Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00155(1566-1576)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00155

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents