Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Decaf: Monocular Deformation Capture for Face and Hand Interactions

Published: 05 December 2023 Publication History

Abstract

Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations.
Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf

Supplemental Material

MP4 File
supplemental

References

[1]
Jascha Achenbach, Robert Brylka, Thomas Gietzen, Katja zum Hebel, Elmar Schömer, Ralf Schulze, Mario Botsch, and Ulrich Schwanecke. 2018. A multilinear model for bidirectional craniofacial reconstruction. In Proceedings of the Eurographics Workshop on Visual Computing for Biology and Medicine. 67--76.
[2]
Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018).
[3]
Aljaz Bozic, Pablo Palafox, Michael Zollöfer, Angela Dai, Justus Thies, and Matthias Nießner. 2020. Neural Non-Rigid Tracking. (2020).
[4]
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision (ICCV).
[5]
Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, and Jitendra Malik. 2021. Reconstructing hand-object interactions in the wild. In International Conference on Computer Vision (ICCV).
[6]
Rishabh Dabral, Soshi Shimada, Arjun Jain, Christian Theobalt, and Vladislav Golyanik. 2021. Gravity-Aware Monocular 3D Human-Object Reconstruction. In International Conference on Computer Vision (ICCV).
[7]
Radek Danecek, Michael J. Black, and Timo Bolkart. 2022. EMOCA: Emotion Driven Monocular Face Capture and Animation. In Conference on Computer Vision and Pattern Recognition (CVPR). 20311--20322.
[8]
Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael Black. 2021a. Collaborative Regression of Expressive Bodies using Moderation. In International Conference on 3D Vision (3DV). 792--804.
[9]
Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021b. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40, 4 (2021), 1--13.
[10]
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2020. Three-dimensional reconstruction of human interactions. In Computer Vision and Pattern Recognition (CVPR).
[11]
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2021. Learning complex 3D human self-contact. In Proceedings of the AAAI Conference on Artificial Intelligence.
[12]
David Fuentes-Jimenez, Daniel Pizarro, David Casillas-Perez, Toby Collins, and Adrien Bartoli. 2021. Texture-Generic Deep Shape-From-Template. IEEE Access 9 (2021), 75211--75230.
[13]
Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. 2013. Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Graph. 32, 6 (2013), 158--1.
[14]
Pablo Garrido, Michael Zollhöfer, Chenglei Wu, Derek Bradley, Patrick Pérez, Thabo Beeler, and Christian Theobalt. 2016. Corrective 3D reconstruction of lips from monocular video. ACM Trans. Graph. 35, 6 (2016), 219--1.
[15]
Erik Gärtner, Mykhaylo Andriluka, Erwin Coumans, and Cristian Sminchisescu. 2022a. Differentiable dynamics for articulated 3d human motion reconstruction. In Computer Vision and Pattern Recognition (CVPR).
[16]
Erik Gärtner, Mykhaylo Andriluka, Hongyi Xu, and Cristian Sminchisescu. 2022b. Trajectory optimization for physics-based reconstruction of 3d human pose from monocular video. In Computer Vision and Pattern Recognition (CVPR).
[17]
Vladislav Golyanik, Soshi Shimada, Kiran Varanasi, and Didier Stricker. 2018. Hdm-net: Monocular non-rigid 3d reconstruction with learned deformation model. In Virtual Reality and Augmented Reality: 15th EuroVR International Conference, EuroVR 2018, London, UK, October 22--23, 2018, Proceedings 15. Springer, 51--72.
[18]
Patrick Grady, Chengcheng Tang, Christopher D. Twigg, Minh Vo, Samarth Brahmbhatt, and Charles C. Kemp. 2021. ContactOpt: Optimizing Contact to Improve Grasps. In Conference on Computer Vision and Pattern Recognition (CVPR).
[19]
Kaiwen Guo, Feng Xu, Tao Yu, Xiaoyang Liu, Qionghai Dai, and Yebin Liu. 2017. Realtime geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1.
[20]
Marc Habermann, Weipeng Xu, Helge Rhodin, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2018. NRST: Non-rigid Surface Tracking from Monocular Video. In German Conference on Pattern Recognition (GCPR).
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).
[22]
Haoyu Hu, Xinyu Yi, Hao Zhang, Jun-Hai Yong, and Feng Xu. 2022. Physical Interaction: Reconstructing Hand-object Interactions with Physics. In SIGGRAPH Asia 2022 Conference Papers.
[23]
Buzhen Huang, Liang Pan, Yuan Yang, Jingyi Ju, and Yangang Wang. 2022. Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture. In Computer Vision and Pattern Recognition (CVPR).
[24]
Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics (ToG) 34, 4 (2015), 1--14.
[25]
Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. Volumedeform: Real-time volumetric non-rigid reconstruction. In International Conference on Computer Vision (ICCV).
[26]
Navami Kairanda, Edgar Tretschk, Mohamed Elgharib, Christian Theobalt, and Vladislav Golyanik. 2022. φ-SfT: Shape-from-Template with a Physics-based Deformation Model. In Computer Vision and Pattern Recognition (CVPR).
[27]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[28]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR).
[29]
Yen Lee Angela Kwok, Jan Gralton, and Mary-Louise McLaws. 2015. Face touching: a frequent habit that has implications for hand hygiene. American journal of infection control 43, 2 (2015), 112--114.
[30]
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. 2020. AvatarMe: Realistically Renderable 3D Facial Reconstruction" in-the-wild". In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 760--769.
[31]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics (TOG) 36, 6 (2017), 194:1--194:17.
[32]
Zhi Li, Soshi Shimada, Bernt Schiele, Christian Theobalt, and Vladislav Golyanik. 2022. MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes. In International Conference on 3D Vision (3DV).
[33]
Wenbin Lin, Chengwei Zheng, Jun-Hai Yong, and Feng Xu. 2022. Occlusionfusion: Occlusion-aware motion estimation for real-time dynamic 3d reconstruction. In Computer Vision and Pattern Recognition (CVPR).
[34]
Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xiaolong Wang. 2021. Semi-supervised 3d hand-object poses estimation with interactions in time. In Computer Vision and Pattern Recognition (CVPR).
[35]
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for perceiving and processing reality. In Workshop on Computer Vision for AR/VR at Computer Vision and Pattern Recognition (CVPRW).
[36]
Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. 2021. Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems (NeurIPS) (2021).
[37]
Zhengyi Luo, Shun Iwase, Ye Yuan, and Kris Kitani. 2022. Embodied Scene-aware Human Pose Estimation. Advances in Neural Information Processing Systems (NeurIPS) (2022).
[38]
Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML).
[39]
Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr Sotnychenko, Mickeal Verschoor, Miguel A Otaduy, Dan Casas, and Christian Theobalt. 2019. Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics (ToG) 38, 4 (2019).
[40]
Lea Müller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang, and Michael J. Black. 2021. On Self-Contact and Human Pose. In Computer Vision and Pattern Recognition (CVPR).
[41]
Matthias Müller, Bruno Heidelberger, Marcus Hennix, and John Ratcliff. 2007. Position Based Dynamics. J. Vis. Comun. Image Represent. 18, 2 (apr 2007), 109--118.
[42]
Dat Tien Ngo, Sanghyuk Park, Anne Jorstad, Alberto Crivellaro, Chang D. Yoo, and Pascal Fua. 2015. Dense Image Registration and Deformable Surface Reconstruction in Presence of Occlusions and Minimal Texture. In International Conference on Computer Vision (ICCV).
[43]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).
[44]
Antoine Petit, Stéphane Cotin, Vincenzo Lippiello, and Bruno Siciliano. 2018. Capturing deformations of interacting non-rigid objects using rgb-d data. In International Conference on Intelligent Robots and Systems (IROS).
[45]
Pexels. 2023. Pexels. https://www.pexels.com/. Accessed: 2023-10-11.
[46]
Davis Rempe, Leonidas J Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. 2020. Contact and Human Dynamics from Monocular Video. In European Conference on Computer Vision (ECCV).
[47]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics (TOG) 36, 6 (Nov. 2017).
[48]
Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-time facial segmentation and performance capture from rgb input. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 244--261.
[49]
Mathieu Salzmann, Julien Pilet, Slobodan Ilic, and Pascal Fua. 2007. Surface Deformation Models for Nonrigid 3D Shape Recovery. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29, 8 (2007), 1481--1487.
[50]
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman. 2020. Background Matting: The World is Your Green Screen. In Computer Vision and Pattern Regognition (CVPR).
[51]
Soshi Shimada, Vladislav Golyanik, Zhi Li, Patrick Pérez, Weipeng Xu, and Christian Theobalt. 2022. HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance. In European Conference on Computer Vision (ECCV).
[52]
Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Didier Stricker. 2019. Ismogan: Adversarial learning for monocular non-rigid 3d reconstruction. In Computer Vision and Pattern Recognition Workshops (CVPRW).
[53]
Soshi Shimada, Vladislav Golyanik, Weipeng Xu, Patrick Pérez, and Christian Theobalt. 2021. Neural Monocular 3D Human Motion Capture with Physical Awareness. ACM Transactions on Graphics (TOG) 40, 4, Article 83 (aug 2021).
[54]
Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time. ACM Transactions on Graphics 39, 6, Article 235 (dec 2020).
[55]
Miroslava Slavcheva, Maximilian Baust, Daniel Cremers, and Slobodan Ilic. 2017. Killing-fusion: Non-rigid 3d reconstruction without correspondences. In Computer Vision and Pattern Recognition (CVPR).
[56]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems (NeurIPS) (2015).
[57]
Bugra Tekin, Federica Bogo, and Marc Pollefeys. 2019. H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In Computer Vision and Pattern Recognition (CVPR).
[58]
Ayush Tewari, Michael Zollöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Theobalt Christian. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In The IEEE International Conference on Computer Vision (ICCV).
[59]
Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2387--2395.
[60]
Edith Tretschk, Navami Kairanda, Mallikarjun B R, Rishabh Dabral, Adam Kortylewski, Bernhard Egger, Marc Habermann, Pascal Fua, Christian Theobalt, and Vladislav Golyanik. 2023. State of the Art in Dense Monocular Non-Rigid 3D Reconstruction. Computer Graphics Forum (EG STAR 2023) (2023).
[61]
Aggeliki Tsoli and Antonis A Argyros. 2018. Joint 3D tracking of a deformable object in interaction with a hand. In European Conference on Computer Vision (ECCV).
[62]
Jiayi Wang, Diogo Luvizon, Franziska Mueller, Florian Bernard, Adam Kortylewski, Dan Casas, and Christian Theobalt. 2022. HandFlow: Quantifying View-Dependent 3D Ambiguity in Two-Hand Reconstruction with Normalizing Flow. Vision, Modeling, and Visualization (2022).
[63]
Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. An anatomically-constrained local deformation model for monocular face capture. ACM transactions on graphics (TOG) 35, 4 (2016), 1--12.
[64]
Kevin Xie, Tingwu Wang, Umar Iqbal, Yunrong Guo, Sanja Fidler, and Florian Shkurti. 2021. Physics-based human motion estimation and synthesis from videos. In International Conference on Computer Vision (ICCV).
[65]
Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. 2022. Physical Inertial Poser (PIP): Physics-aware Realtime Human Motion Tracking from Sparse Inertial Sensors. In Computer Vision and Pattern Recognition (CVPR).
[66]
Alex Yu. 2023. Triangle mesh to signed-distance function (SDF). https://github.com/sxyu/sdf.
[67]
Rui Yu, Chris Russell, Neill DF Campbell, and Lourdes Agapito. 2015. Direct, dense, and deformable: Template-based non-rigid 3d reconstruction from rgb video. In International Conference on Computer Vision (ICCV).
[68]
Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. 2021. Simpoe: Simulated character control for 3d human pose estimation. In Computer vision and pattern recognition (CVPR).
[69]
Baowen Zhang, Yangang Wang, Xiaoming Deng, Yinda Zhang, Ping Tan, Cuixia Ma, and Hongan Wang. 2021a. Interacting two-hand 3d pose and shape reconstruction from single color image. In International Conference on Computer Vision (ICCV).
[70]
Hao Zhang, Zi-Hao Bo, Jun-Hai Yong, and Feng Xu. 2019. InteractionFusion: real-time reconstruction of hand poses and deformable objects in hand-object interactions. ACM Transactions on Graphics (TOG) 38, 4 (2019).
[71]
Hao Zhang, Yuxiao Zhou, Yifei Tian, Jun-Hai Yong, and Feng Xu. 2021b. Single depth view based real-time reconstruction of hand-object interactions. ACM Transactions on Graphics (TOG) 40, 3 (2021).
[72]
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5745--5753.
[73]
Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. 2014. Real-time non-rigid reconstruction using an RGB-D camera. ACM Transactions on Graphics (ToG) 33, 4 (2014).

Cited By

View all
  • (2024)GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine RepresentationsACM Transactions on Graphics10.1145/368792743:6(1-12)Online publication date: 19-Nov-2024
  • (2024)Recent Trends in 3D Reconstruction of General Non‐Rigid ScenesComputer Graphics Forum10.1111/cgf.1506243:2Online publication date: 30-Apr-2024
  • (2024)3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera2024 International Conference on 3D Vision (3DV)10.1109/3DV62453.2024.00008(291-301)Online publication date: 18-Mar-2024

Index Terms

  1. Decaf: Monocular Deformation Capture for Face and Hand Interactions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 42, Issue 6
    December 2023
    1565 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3632123
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2023
    Published in TOG Volume 42, Issue 6

    Check for updates

    Author Tags

    1. deformation
    2. interaction
    3. monocular
    4. motion capture

    Qualifiers

    • Research-article

    Funding Sources

    • ERC Consolidator Grant 4DRepLy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)404
    • Downloads (Last 6 weeks)36
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine RepresentationsACM Transactions on Graphics10.1145/368792743:6(1-12)Online publication date: 19-Nov-2024
    • (2024)Recent Trends in 3D Reconstruction of General Non‐Rigid ScenesComputer Graphics Forum10.1111/cgf.1506243:2Online publication date: 30-Apr-2024
    • (2024)3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera2024 International Conference on 3D Vision (3DV)10.1109/3DV62453.2024.00008(291-301)Online publication date: 18-Mar-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media