Article

Expressive Whole-Body 3D Gaussian Avatar

Authors:

Gyeongsik Moon,

Takaaki Shiratori,

Shunsuke SaitoAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI

Pages 19 - 35

https://doi.org/10.1007/978-3-031-72940-9_2

Published: 17 November 2024 Publication History

Abstract

Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions. In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.

References

[1]

Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: CVPR (2018)

[2]

Alldieck, T., Xu, H., Sminchisescu, C.: imGHUM: implicit generative models of 3D human shape and articulated pose. In: ICCV (2021)

[3]

Bagautdinov T et al. Driving-signal aware full-body avatars ACM TOG 2021 40 1-17

Digital Library

[4]

Cai, Z., et al.: SMPLer-X: scaling up expressive human pose and shape estimation. In: NeurIPS (2023)

[5]

Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)

[6]

Chen, J., et al.: Animatable neural radiance fields from monocular RGB videos. arXiv preprint arXiv:2106.13629 (2021)

[7]

Chen, Z., et al.: URhand: universal relightable hands. In: CVPR (2024)

[8]

Choi, H., Moon, G., Armando, M., Leroy, V., Lee, K.M., Rogez, G.: MonoNHR: monocular neural human renderer. In: 3DV (2022)

[9]

Choutas V, Pavlakos G, Bolkart T, Tzionas D, and Black MJ Vedaldi A, Bischof H, Brox T, and Frahm J-M Monocular expressive body regression through body-driven attention Computer Vision – ECCV 2020 2020 Cham Springer 20-40

Digital Library

[10]

Contributors, M.: Openmmlab pose estimation toolbox and benchmark (2020). https://github.com/open-mmlab/mmpose

[11]

Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)

[12]

Feng Y, Feng H, Black MJ, and Bolkart T Learning an animatable detailed 3D face model from in-the-wild images ACM TOG 2021 40 1-13

Digital Library

[13]

Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. In: CVPR (2023)

[14]

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

[15]

Hu, L., et al.: GaussianAvatar: towards realistic human avatar modeling from a single video via animatable 3D Gaussians. arXiv preprint arXiv:2312.02134 (2023)

[16]

Jiang, T., Chen, X., Song, J., Hilliges, O.: InstantAvatar: learning avatars from monocular video in 60 seconds. In: CVPR (2023)

[17]

Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: NeuMan: neural human radiance field from a single video. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13692, pp. 402–418. Springer, Cham (2022).

Digital Library

[18]

Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: CVPR (2018)

[19]

Kerbl B, Kopanas G, Leimkühler T, and Drettakis G 3D Gaussian splatting for real-time radiance field rendering ACM TOG 2023 42 4 139-1

Digital Library

[20]

Kocabas, M., Chang, J.H.R., Gabriel, J., Tuzel, O., Ranjan, A.: HUGS: human Gaussian splats. arXiv preprint arXiv:2311.17910 (2023)

[21]

Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. In: NeurIPS (2021)

[22]

Li, J., Bian, S., Xu, C., Chen, Z., Yang, L., Lu, C.: HybrIK-X: hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690 (2023)

[23]

Li T, Bolkart T, Black MJ, Li H, and Romero J Learning a model of facial shape and expression from 4D scans ACM TOG 2017 36 1-17

Digital Library

[24]

Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)

[25]

Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)

[26]

Liu, X., et al.: GEA: reconstructing expressive 3D Gaussian avatar from monocular video. arXiv preprint arXiv:2402.16607 (2024)

[27]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM (2021)

[28]

Moon, G., Choi, H., Lee, K.M.: Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In: CVPRW (2022)

[29]

Moon G, Shiratori T, and Lee KM Vedaldi A, Bischof H, Brox T, and Frahm J-M DeepHandMesh: a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling Computer Vision – ECCV 2020 2020 Cham Springer 440-455

Digital Library

[30]

Moon, G., Xu, W., Joshi, R., Wu, C., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: CVPR (2024)

[31]

Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)

[32]

Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

[33]

Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021)

[34]

Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: CVPR (2021)

[35]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)

[36]

Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3DGS-avatar: animatable avatars via deformable 3D Gaussian splatting. In: CVPR (2024)

[37]

Ravi, N., et al.: Accelerating 3D deep learning with PyTorch3D. arXiv preprint arXiv:2007.08501 (2020)

[38]

Remelli, E., et al.: Drivable volumetric avatars using texel-aligned features. In: ACM SIGGRAPH Conference Proceedings (2022)

[39]

Rong, Y., Shiratori, T., Joo, H.: FrankMocap: a monocular 3D whole-body pose estimation system via regression and integration. In: ICCVW (2021)

[40]

Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)

[41]

Shen, K., et al.: X-Avatar: expressive human avatars. In: CVPR (2023)

[42]

Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: HumanNeRF: free-viewpoint rendering of moving people from monocular video. In: CVPR (2022)

[43]

Xu, H., Bazavan, E.G., Zanfir, A., Freeman, W.T., Sukthankar, R., Sminchisescu, C.: GHUM & GHUML: generative 3D human shape and articulated pose models. In: CVPR (2020)

[44]

Zhang H et al. PyMAF-X: towards well-aligned full-body model regression from monocular images TPAMI 2023 45 10 12287-12303

Digital Library

[45]

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

Index Terms

Expressive Whole-Body 3D Gaussian Avatar
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
      2. Image and video acquisition
        3D imaging
  2. Computer graphics
    1. Animation
    2. Graphics systems and interfaces

Index terms have been assigned to the content through auto-classification.

Recommendations

Expressive forecasting of 3D whole-body human motions
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on predicting the major joints of the human body ...
Gaussian mixture 3D morphable face model

A Gaussian Mixture 3DMM (GM-3DMM) which models the global population as a mixture of Gaussian subpopulations.A ESO-based model selection strategy for GM-3DMM fitting.A GM-3DMM-based face recognition framework by fusing multiple experts, which has ...
Whole-body motion imitation using human modeling
ROBIO '09: Proceedings of the 2008 IEEE International Conference on Robotics and Biomimetics

The purpose of this study is to develop a methodology that enables a humanoid robot to imitate a whole body motion of a human. In the communication and interaction with a human being with motions and gestures, a humanoid robot needs not only to look ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI

Sep 2024

585 pages

ISBN:978-3-031-72939-3

DOI:10.1007/978-3-031-72940-9

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
Princeton University, Princeton, NJ, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 November 2024

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents