MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space

Comas-Massagué, Armand; Qiu, Di; Chai, Menglei; Bühler, Marcel; Raj, Amit; Gao, Ruiqi; Xu, Qiangeng; Matthews, Mark; Gotardo, Paulo; Orts-Escolano, Sergio; Beeler, Thabo

doi:10.1007/978-3-031-72848-8_11

Armand Comas-Massagué^13,14,
Di Qiu¹³,
Menglei Chai¹³,
Marcel Bühler^13,15,
Amit Raj¹³,
Ruiqi Gao¹⁶,
Qiangeng Xu¹³,
Mark Matthews¹³,
Paulo Gotardo¹³,
Sergio Orts-Escolano¹³ &
…
Thabo Beeler¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15124))

Included in the following conference series:

European Conference on Computer Vision

74 Accesses

Abstract

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts. You can find more results and videos in our website: syntec-research.github.io/MagicMirror.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Instant 3D Human Avatar Generation Using Image Diffusion Models

AvatarGen: A 3D Generative Model for Animatable Human Avatars

MeshAvatar: Learning High-Quality Triangular Human Avatars from Multi-view Videos

References

Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., Shu, Z.: RigNeRF: fully controllable neural 3D portraits. In: CVPR 2022, pp. 20332–20341 (2022)
Google Scholar
Bai, Z., Cui, Z., Liu, X., Tan, P.: Riggable 3D face reconstruction via in-network optimization. In: CVPR 2021, pp. 6216–6225 (2021)
Google Scholar
Bai, Z., et al.: Learning personalized high quality volumetric head avatars from monocular RGB videos. In: CVPR 2023, pp. 16890–16900 (2023)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-Nerf 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH 1999, pp. 187–194 (1999)
Google Scholar
Bojanowski, P., Joulin, A., Lopez-Paz, D., Szlam, A.: Optimizing the latent space of generative networks. In: Proceedings of the 35th International Conference on Machine Learning, pp. 2640–3498 (2018)
Google Scholar
Bühler, M.C., et al.: Preface: a data-driven volumetric prior for few-shot ultra high-resolution face synthesis. In: ICCV 2023, pp. 3379–3390 (2023)
Google Scholar
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41(4), 163:1–163:19 (2022)
Google Scholar
Cao, C., Wu, H., Weng, Y., Shao, T., Zhou, K.: Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35(4), 126:1–126:12 (2016)
Google Scholar
Cao, Y., Cao, Y., Han, K., Shan, Y., Wong, K.K.: Dreamavatar: text-and-shape guided 3D human avatar generation via diffusion models. CoRR abs/2304.00916 (2023)
Google Scholar
Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR 2022, pp. 16102–16112 (2022)
Google Scholar
Chan, E.R., et al.: Generative novel view synthesis with 3D-aware diffusion models. In: ICCV 2023, pp. 4194–4206 (2023)
Google Scholar
Chaudhuri, B., Vesdapunt, N., Shapiro, L., Wang, B.: Personalized face modeling for improved face reconstruction and motion retargeting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 142–160. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_9
Chapter Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. In: ICCV 2023, pp. 22189–22199 (2023)
Google Scholar
Chen, S., Liu, F., Lai, Y., Rosin, P.L., Li, C., Fu, H., Gao, L.: Deepfaceediting: deep face generation and editing with disentangled geometry and appearance control. ACM Trans. Graph. 40(4), 90:1–90:15 (2021)
Google Scholar
Daněček, R., Black, M.J., Bolkart, T.: EMOCA: emotion driven monocular face capture and animation. In: CVPR (2022)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: CVPR (2023)
Google Scholar
Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: Neural Information Processing Systems (2020). https://api.semanticscholar.org/CorpusID:214223619
Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: CVPR 2021, pp. 8649–8658 (2021)
Google Scholar
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. 41(4), 141:1–141:13 (2022)
Google Scholar
Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. 35(3), 28:1–28:15 (2016)
Google Scholar
Han, X., et al.: HeadSculpt: crafting 3D head avatars with text. In: NeurIPS 2023 (2023)
Google Scholar
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: CVPR (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. CoRR abs/2207.12598 (2022)
Google Scholar
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. 41(4), 161:1–161:19 (2022)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR 2022 (2022)
Google Scholar
Hu, L., et al.: Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. 36(6), 195:1–195:14 (2017)
Google Scholar
Huang, X., Shao, R., Zhang, Q., Zhang, H., Feng, Y., Liu, Y., Wang, Q.: HumanNorm: learning normal diffusion model for high-quality and realistic 3d human generation. CoRR abs/2310.01406 (2023)
Google Scholar
Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3D content creation. CoRR abs/2306.12422 (2023)
Google Scholar
Huang, Y., et al.: Dreamwaltz: make a scene with complex 3d animatable avatars. In: NeurIPS 2023 (2023)
Google Scholar
Ichim, A.E., Bouaziz, S., Pauly, M.: Dynamic 3D avatar creation from hand-held video input. ACM Trans. Graph. 34(4), 45:1–45:14 (2015)
Google Scholar
Jiang, R., et al.: Avatarcraft: transforming text into neural human avatars with parameterized shape and pose control. In: ICCV 2023, pp. 14325–14336 (2023)
Google Scholar
Jun, H., Nichol, A.: Shap-e: generating conditional 3D implicit functions. CoRR abs/2305.02463 (2023)
Google Scholar
Karnewar, A., Vedaldi, A., Novotný, D., Mitra, N.J.: HOLODIFFUSION: training a 3D diffusion model using 2D images. In: CVPR 2023, pp. 18423–18433 (2023)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4217–4228 (2021)
Article Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR 2020, pp. 8107–8116 (2020)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 139:1–139:14 (2023)
Google Scholar
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. 37(4), 163 (2018)
Article Google Scholar
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. ArXiv abs/2305.01569 (2023). https://api.semanticscholar.org/CorpusID:258437096
Kolotouros, N., Alldieck, T., Zanfir, A., Bazavan, E.G., Fieraru, M., Sminchisescu, C.: Dreamhuman: animatable 3D avatars from text. In: NeurIPS 2023 (2023)
Google Scholar
Lin, C., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR 2023, pp. 300–309 (2023)
Google Scholar
Liu, N., Li, S., Du, Y., Tenenbaum, J.B., Torralba, A.: Learning to compose visual relations. ArXiv abs/2111.09297 (2021). https://api.semanticscholar.org/CorpusID:244270027
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. ArXiv abs/2206.01714 (2022). https://api.semanticscholar.org/CorpusID:249375227
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: ICCV 2023, pp. 9264–9275 (2023)
Google Scholar
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. CoRR abs/2309.03453 (2023)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248:1–248:16 (2015)
Google Scholar
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR 2021, pp. 2837–2845 (2021)
Google Scholar
Mendiratta, M., et al.: Avatarstudio: text-driven editing of 3D dynamic human head avatars. ACM Trans. Graph. 42(6), 226:1–226:18 (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. CoRR abs/2212.08751 (2022)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. In: ICLR 2023 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2021, vol. 139, pp. 8748–8763 (2021)
Google Scholar
Raj, A., et al.: DreamBooth3D: subject-driven text-to-3D generation (2023)
Google Scholar
Raj, A., et al.: Pixel-aligned volumetric avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11733–11742 (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR 2022, pp. 10674–10685 (2022)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR 2023, pp. 22500–22510 (2023)
Google Scholar
Shen, T., Gao, J., Yin, K., Liu, M., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS 2021, pp. 6087–6101 (2021)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterFaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 2004–2018 (2022)
Article Google Scholar
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. CoRR abs/2308.16512 (2023)
Google Scholar
Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D neural field generation using triplane diffusion. In: CVPR 2023, pp. 20875–20886 (2023)
Google Scholar
Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. In: ICCV 2023, pp. 22762–22772 (2023)
Google Scholar
Tewari, A., et al.: FML: face model learning from videos. In: CVPR 2019, pp. 10812–10822 (2019)
Google Scholar
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of RGB videos. In: CVPR 2016, pp. 2387–2395 (2016)
Google Scholar
Trevithick, A., et al.: Real-time radiance fields for single-image portrait view synthesis. ACM Trans. Graph. 42(4), 135:1–135:15 (2023)
Google Scholar
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: TextMesh: generation of realistic 3d meshes from text prompts. CoRR abs/2304.12439 (2023)
Google Scholar
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. IEEE TVCG (2023)
Google Scholar
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: Nerf-art: text-driven neural radiance fields stylization. IEEE Trans. Vis. Comput. Graph. 01, 1–15 (2023)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR 2023, pp. 12619–12629 (2023)
Google Scholar
Wang, T., et al.: RODIN: a generative model for sculpting 3D digital avatars using diffusion. In: CVPR 2023, pp. 4563–4573 (2023)
Google Scholar
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. In: NeurIPS 2023 (2023)
Google Scholar
Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: ICLR 2023 (2023)
Google Scholar
Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Graph. 30(4), 77 (2011)
Article Google Scholar
Xu, Y., Yang, Z., Yang, Y.: SEEAvatar: photorealistic text-to-3D avatar generation with constrained geometry and appearance. CoRR abs/2312.08889 (2023)
Google Scholar
Zeng, Y., Lu, Y., Ji, X., Yao, Y., Zhu, H., Cao, X.: Avatarbooth: high-quality and customizable 3d human avatar generation. CoRR abs/2306.09864 (2023)
Google Scholar
Zhang, L., et al.: DreamFace: progressive generation of animatable 3D faces under text guidance. ACM Trans. Graph. 42(4), 138:1–138:16 (2023)
Google Scholar
Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: I M avatar: implicit morphable head avatars from videos. In: CVPR 2022, pp. 13535–13545 (2022)
Google Scholar
Zhou, Z., Tulsiani, S.: SparseFusion: distilling view-conditioned diffusion for 3D reconstruction. In: CVPR 2023, pp. 12588–12597 (2023)
Google Scholar
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Mind the gap: domain gap control for single shot domain adaptation for generative adversarial networks. In: ICLR 2022 (2022)
Google Scholar

Download references

Acknowledgements

We would like to thank Prof. Octavia Camps and ONR grant N00014-21-1-2431 from NCI for their support.

Author information

Authors and Affiliations

Google AR, Mountain View, USA
Armand Comas-Massagué, Di Qiu, Menglei Chai, Marcel Bühler, Amit Raj, Qiangeng Xu, Mark Matthews, Paulo Gotardo, Sergio Orts-Escolano & Thabo Beeler
Northeastern University, Boston, USA
Armand Comas-Massagué
ETH Zürich, Zürich, Switzerland
Marcel Bühler
Google DeepMind, London, UK
Ruiqi Gao

Authors

Armand Comas-Massagué
View author publications
You can also search for this author in PubMed Google Scholar
Di Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Menglei Chai
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Bühler
View author publications
You can also search for this author in PubMed Google Scholar
Amit Raj
View author publications
You can also search for this author in PubMed Google Scholar
Ruiqi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qiangeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mark Matthews
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Gotardo
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Orts-Escolano
View author publications
You can also search for this author in PubMed Google Scholar
Thabo Beeler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Qiu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7001 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Comas-Massagué, A. et al. (2025). MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-72848-8_11
Published: 29 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Instant 3D Human Avatar Generation Using Image Diffusion Models

AvatarGen: A 3D Generative Model for Animatable Human Avatars

MeshAvatar: Learning High-Quality Triangular Human Avatars from Multi-view Videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7001 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Instant 3D Human Avatar Generation Using Image Diffusion Models

AvatarGen: A 3D Generative Model for Animatable Human Avatars

MeshAvatar: Learning High-Quality Triangular Human Avatars from Multi-view Videos

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7001 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation