Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3588432.3591490acmconferencesArticle/Chapter ViewAbstractPublication PagessiggraphConference Proceedingsconference-collections
research-article
Open access

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling

Published: 23 July 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Creating pose-driven human avatars is about modeling the mapping from the low-frequency driving pose to high-frequency dynamic human appearances, so an effective pose encoding method that can encode high-fidelity human details is essential to human avatar modeling. To this end, we present PoseVocab, a novel pose encoding method that encourages the network to discover the optimal pose embeddings for learning the dynamic human appearance. Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses. To achieve pose generalization and temporal consistency, we sample key rotations in so(3) of each joint rather than the global pose vectors, and assign a pose embedding to each sampled key rotation. These joint-structured pose embeddings not only encode the dynamic appearances under different key poses, but also factorize the global pose embedding into joint-structured ones to better learn the appearance variation related to the motion of each joint. To improve the representation ability of the pose embedding while maintaining memory efficiency, we introduce feature lines, a compact yet effective 3D representation, to model more fine-grained details of human appearances. Furthermore, given a query pose and a spatial position, a hierarchical query strategy is introduced to interpolate pose embeddings and acquire the conditional pose feature for dynamic human synthesis. Overall, PoseVocab effectively encodes the dynamic details of human appearance and enables realistic and generalized animation under novel poses. Experiments show that our method outperforms other state-of-the-art baselines both qualitatively and quantitatively in terms of synthesis quality. Code is available at https://github.com/lizhe00/PoseVocab.

    Supplemental Material

    MP4 File
    presentation
    ZIP File
    Supplementary document and video

    References

    [1]
    Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018. Video based reconstruction of 3d people models. In CVPR. 8387–8397.
    [2]
    Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. 2019. Tex2shape: Detailed full human body geometry from a single image. In ICCV. 2293–2303.
    [3]
    Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabian Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. 2021. Driving-signal aware full-body avatars. TOG 40, 4 (2021), 1–17.
    [4]
    Andrei Burov, Matthias Nießner, and Justus Thies. 2021. Dynamic surface function networks for clothed human bodies. In ICCV. 10754–10764.
    [5]
    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, 2022. Efficient geometry-aware 3D generative adversarial networks. In CVPR. 16123–16133.
    [6]
    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. 2022. TensoRF: Tensorial Radiance Fields. In ECCV.
    [7]
    Xu Chen, Yufeng Zheng, Michael J Black, Otmar Hilliges, and Andreas Geiger. 2021. SNARF: Differentiable forward skinning for animating non-rigid neural implicit shapes. In ICCV. 11594–11604.
    [8]
    Enric Corona, Tomas Hodan, Minh Vo, Francesc Moreno-Noguer, Chris Sweeney, Richard Newcombe, and Lingni Ma. 2022. LISA: Learning implicit shape and appearance of hands. In CVPR. 20533–20543.
    [9]
    Boyang Deng, John P Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. 2020. NASA neural articulated shape approximation. In ECCV. Springer, 612–628.
    [10]
    Junting Dong, Qi Fang, Yudong Guo, Sida Peng, Qing Shuai, Xiaowei Zhou, and Hujun Bao. 2022a. TotalSelfScan: Learning Full-body Avatars from Self-Portrait Videos of Faces, Hands, and Bodies. In NeurIPS.
    [11]
    Zijian Dong, Chen Guo, Jie Song, Xu Chen, Andreas Geiger, and Otmar Hilliges. 2022b. PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D Video Sequence. In CVPR.
    [12]
    Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart. 2022. Capturing and Animation of Body and Clothing from Monocular Video. In SIGGRAPH Asia 2022 Conference Proceedings (Daegu, Republic of Korea) (SA ’22). Article 45, 9 pages.
    [13]
    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance Fields Without Neural Networks. In CVPR. 5501–5510.
    [14]
    Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. 2022. Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–12.
    [15]
    Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. 2020. Implicit Geometric Regularization for Learning Shapes. In ICML. PMLR, 3789–3799.
    [16]
    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. 2022. Generating diverse and natural 3d human motions from text. In CVPR. 5152–5161.
    [17]
    Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2021. Real-time deep dynamic characters. TOG 40, 4 (2021), 1–16.
    [18]
    Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard Pons-Moll, and Christian Theobalt. 2020. Deepcap: Monocular human performance capture using weak supervision. In CVPR. 5052–5063.
    [19]
    Oshri Halimi, Tuur Stuyck, Donglai Xiang, Timur Bagautdinov, He Wen, Ron Kimmel, Takaaki Shiratori, Chenglei Wu, Yaser Sheikh, and Fabian Prada. 2022. Pattern-Based Cloth Registration and Sparse-View Animation. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–17.
    [20]
    Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, and Tony Tung. 2021. ARCH++: Animation-ready clothed human reconstruction revisited. In ICCV. 11046–11056.
    [21]
    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30 (2017).
    [22]
    Yangyi Huang, Hongwei Yi, Weiyang Liu, Haofan Wang, Boxi Wu, Wenxiao Wang, Binbin Lin, Debing Zhang, and Deng Cai. 2022. One-shot Implicit Animatable Avatars with Model-based Priors. arXiv preprint arXiv:2212.02469 (2022).
    [23]
    Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony Tung. 2020. Arch: Animatable reconstruction of clothed humans. In CVPR. 3093–3102.
    [24]
    Du Q Huynh. 2009. Metrics for 3D rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision 35, 2 (2009), 155–164.
    [25]
    Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. 2022b. SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video. In CVPR. 5605–5615.
    [26]
    Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2022a. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds. arXiv preprint arXiv:2212.10550 (2022).
    [27]
    Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. 2022c. Neuman: Neural human radiance field from a single video. In ECCV. Springer, 402–418.
    [28]
    Hyomin Kim, Hyeonseo Nam, Jungeon Kim, Jaesik Park, and Seungyong Lee. 2022. LaplacianFusion: Detailed 3D Clothed-Human Body Reconstruction. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–14.
    [29]
    Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
    [30]
    Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhöfer, Jürgen Gall, Angjoo Kanazawa, and Christoph Lassner. 2022a. Tava: Template-free animatable volumetric actors. In ECCV. Springer, 419–436.
    [31]
    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV. 13401–13412.
    [32]
    Zhe Li, Zerong Zheng, Hongwen Zhang, Chaonan Ji, and Yebin Liu. 2022b. AvatarCap: Animatable Avatar Conditioned Monocular Human Volumetric Capture. In ECCV. Springer, 322–341.
    [33]
    Siyou Lin, Hongwen Zhang, Zerong Zheng, Ruizhi Shao, and Yebin Liu. 2022. Learning implicit templates for point-based clothed human modeling. In ECCV. Springer, 210–228.
    [34]
    Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. 2021. Neural actor: Neural free-view synthesis of human actors with pose control. TOG 40, 6 (2021), 1–16.
    [35]
    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. TOG 34, 6 (2015), 1–16.
    [36]
    Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang, and Michael J Black. 2021a. SCALE: Modeling clothed humans with a surface codec of articulated local elements. In CVPR. 16082–16093.
    [37]
    Qianli Ma, Jinlong Yang, Siyu Tang, and Michael J Black. 2021b. The power of points for modeling humans in clothing. In ICCV. 10974–10984.
    [38]
    Marko Mihajlovic, Shunsuke Saito, Aayush Bansal, Michael Zollhoefer, and Siyu Tang. 2022. COAP: Compositional articulated occupancy of people. In CVPR. 13201–13210.
    [39]
    Marko Mihajlovic, Yan Zhang, Michael J Black, and Siyu Tang. 2021. LEAP: Learning articulated occupancy of people. In CVPR. 10461–10471.
    [40]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
    [41]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. NeurIPS 26 (2013).
    [42]
    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV. Springer, 405–421.
    [43]
    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics (TOG) 41, 4, Article 102 (jul 2022), 15 pages. https://doi.org/10.1145/3528223.3530127
    [44]
    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR. 165–174.
    [45]
    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In CVPR.
    [46]
    Bo Peng, Jun Hu, Jingtao Zhou, and Juyong Zhang. 2022a. SelfNeRF: Fast Training NeRF for Human from Monocular Self-rotating Video. arXiv preprint arXiv:2210.01651 (2022).
    [47]
    Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. 2021a. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV. 14314–14323.
    [48]
    Sida Peng, Shangzhan Zhang, Zhen Xu, Chen Geng, Boyi Jiang, Hujun Bao, and Xiaowei Zhou. 2022b. Animatable Neural Implicit Surfaces for Creating Avatars from Videos. arXiv preprint arXiv:2203.08133 (2022).
    [49]
    Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. 2021b. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR. 9054–9063.
    [50]
    Edoardo Remelli, Timur Bagautdinov, Shunsuke Saito, Chenglei Wu, Tomas Simon, Shih-En Wei, Kaiwen Guo, Zhe Cao, Fabian Prada, Jason Saragih, 2022. Drivable volumetric avatars using texel-aligned features. In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.
    [51]
    Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J Black. 2021. SCANimate: Weakly supervised learning of skinned clothed avatar networks. In CVPR. 2886–2897.
    [52]
    Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. 2022. Danbo: Disentangled articulated neural body representations via graph neural networks. In ECCV. Springer, 107–124.
    [53]
    Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. 2021. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. NeurIPS 34 (2021), 12278–12291.
    [54]
    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. 2020. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS 33 (2020), 7537–7547.
    [55]
    Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu. 2022. Neural Capture of Animatable 3D Human from Monocular Video. In ECCV. Springer, 275–291.
    [56]
    Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022. Motionclip: Exposing human motion generation to clip space. In ECCV. Springer, 358–374.
    [57]
    Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG) 38, 4 (2019), 1–12.
    [58]
    Garvita Tiwari, Dimitrije Antić, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-ndf: Modeling human pose manifolds with neural distance fields. In ECCV. Springer, 572–589.
    [59]
    Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2021. Neural-GIF: Neural generalized implicit functions for animating people in clothing. In ICCV. 11708–11718.
    [60]
    Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, and Siyu Tang. 2021. Metaavatar: Learning animatable clothed human models from few depth images. NeurIPS 34 (2021).
    [61]
    Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu Tang. 2022. Arah: Animatable volume rendering of articulated human sdfs. In ECCV. Springer, 1–19.
    [62]
    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
    [63]
    Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. In CVPR. 16210–16220.
    [64]
    Donglai Xiang, Timur Bagautdinov, Tuur Stuyck, Fabian Prada, Javier Romero, Weipeng Xu, Shunsuke Saito, Jingfan Guo, Breannan Smith, Takaaki Shiratori, 2022. Dressing Avatars: Deep Photorealistic Appearance for Physically Simulated Clothing. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–15.
    [65]
    Donglai Xiang, Fabian Prada, Timur Bagautdinov, Weipeng Xu, Yuan Dong, He Wen, Jessica Hodgins, and Chenglei Wu. 2021. Modeling clothing as a separate layer for an animatable human avatar. TOG 40, 6 (2021), 1–15.
    [66]
    Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gaurav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz, and Christian Theobalt. 2011. Video-based characters: creating new human performances from a multi-view video database. TOG 30, 4 (2011), 1–10.
    [67]
    Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. 2021. Volume rendering of neural implicit surfaces. NeurIPS 34 (2021), 4805–4815.
    [68]
    Jae Shin Yoon, Duygu Ceylan, Tuanfeng Y Wang, Jingwan Lu, Jimei Yang, Zhixin Shu, and Hyun Soo Park. 2022. Learning motion-dependent appearance for high-fidelity rendering of dynamic humans from a single camera. In CVPR. 3407–3417.
    [69]
    Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. 2021. Plenoctrees for real-time rendering of neural radiance fields. In ICCV. 5752–5761.
    [70]
    Hongwen Zhang, Siyou Lin, Ruizhi Shao, Yuxiang Zhang, Zerong Zheng, Han Huang, Yandong Guo, and Yebin Liu. 2023. CloSET: Modeling Clothed Humans on Continuous Surface with Explicit Template Decomposition. In CVPR.
    [71]
    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR. 586–595.
    [72]
    Zerong Zheng, Han Huang, Tao Yu, Hongwen Zhang, Yandong Guo, and Yebin Liu. 2022. Structured local radiance fields for human avatar modeling. In CVPR. 15893–15903.
    [73]
    Zerong Zheng, Xiaochen Zhao, Hongwen Zhang, Boning Liu, and Yebin Liu. 2023. AvatarReX: Real-time Expressive Full-body Avatars. ACM Transactions on Graphics (TOG) 42, 4 (2023). https://doi.org/10.1145/3592101
    [74]
    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In CVPR. 5745–5753.

    Cited By

    View all
    • (2023)DreamHumanProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666584(10516-10529)Online publication date: 10-Dec-2023
    • (2023)Leveraging Intrinsic Properties for Non-Rigid Garment Alignment2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01332(14439-14450)Online publication date: 1-Oct-2023
    • (2023)CaPhy: Capturing Physical Properties for Animatable Human Avatars2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01301(14104-14114)Online publication date: 1-Oct-2023

    Index Terms

    1. PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGGRAPH '23: ACM SIGGRAPH 2023 Conference Proceedings
      July 2023
      911 pages
      ISBN:9798400701597
      DOI:10.1145/3588432
      This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 July 2023

      Check for updates

      Author Tags

      1. Animatable avatar
      2. human modeling
      3. human synthesis

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • NSFC
      • National Key R&D Program of China

      Conference

      SIGGRAPH '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,822 of 8,601 submissions, 21%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)525
      • Downloads (Last 6 weeks)49
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)DreamHumanProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666584(10516-10529)Online publication date: 10-Dec-2023
      • (2023)Leveraging Intrinsic Properties for Non-Rigid Garment Alignment2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01332(14439-14450)Online publication date: 1-Oct-2023
      • (2023)CaPhy: Capturing Physical Properties for Animatable Human Avatars2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01301(14104-14114)Online publication date: 1-Oct-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media