Abstract
We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts. You can find more results and videos in our website: syntec-research.github.io/MagicMirror.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., Shu, Z.: RigNeRF: fully controllable neural 3D portraits. In: CVPR 2022, pp. 20332–20341 (2022)
Bai, Z., Cui, Z., Liu, X., Tan, P.: Riggable 3D face reconstruction via in-network optimization. In: CVPR 2021, pp. 6216–6225 (2021)
Bai, Z., et al.: Learning personalized high quality volumetric head avatars from monocular RGB videos. In: CVPR 2023, pp. 16890–16900 (2023)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-Nerf 360: unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479 (2022)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH 1999, pp. 187–194 (1999)
Bojanowski, P., Joulin, A., Lopez-Paz, D., Szlam, A.: Optimizing the latent space of generative networks. In: Proceedings of the 35th International Conference on Machine Learning, pp. 2640–3498 (2018)
Bühler, M.C., et al.: Preface: a data-driven volumetric prior for few-shot ultra high-resolution face synthesis. In: ICCV 2023, pp. 3379–3390 (2023)
Cao, C., et al.: Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41(4), 163:1–163:19 (2022)
Cao, C., Wu, H., Weng, Y., Shao, T., Zhou, K.: Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35(4), 126:1–126:12 (2016)
Cao, Y., Cao, Y., Han, K., Shan, Y., Wong, K.K.: Dreamavatar: text-and-shape guided 3D human avatar generation via diffusion models. CoRR abs/2304.00916 (2023)
Chan, E.R., et al.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR 2022, pp. 16102–16112 (2022)
Chan, E.R., et al.: Generative novel view synthesis with 3D-aware diffusion models. In: ICCV 2023, pp. 4194–4206 (2023)
Chaudhuri, B., Vesdapunt, N., Shapiro, L., Wang, B.: Personalized face modeling for improved face reconstruction and motion retargeting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 142–160. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_9
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation. In: ICCV 2023, pp. 22189–22199 (2023)
Chen, S., Liu, F., Lai, Y., Rosin, P.L., Li, C., Fu, H., Gao, L.: Deepfaceediting: deep face generation and editing with disentangled geometry and appearance control. ACM Trans. Graph. 40(4), 90:1–90:15 (2021)
Daněček, R., Black, M.J., Bolkart, T.: EMOCA: emotion driven monocular face capture and animation. In: CVPR (2022)
Deitke, M., et al.: Objaverse: a universe of annotated 3D objects. In: CVPR (2023)
Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: Neural Information Processing Systems (2020). https://api.semanticscholar.org/CorpusID:214223619
Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: CVPR 2021, pp. 8649–8658 (2021)
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: clip-guided domain adaptation of image generators. ACM Trans. Graph. 41(4), 141:1–141:13 (2022)
Garrido, P., et al.: Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. 35(3), 28:1–28:15 (2016)
Han, X., et al.: HeadSculpt: crafting 3D head avatars with text. In: NeurIPS 2023 (2023)
Haque, A., Tancik, M., Efros, A.A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: CVPR (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. CoRR abs/2207.12598 (2022)
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., Liu, Z.: Avatarclip: zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. 41(4), 161:1–161:19 (2022)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR 2022 (2022)
Hu, L., et al.: Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. 36(6), 195:1–195:14 (2017)
Huang, X., Shao, R., Zhang, Q., Zhang, H., Feng, Y., Liu, Y., Wang, Q.: HumanNorm: learning normal diffusion model for high-quality and realistic 3d human generation. CoRR abs/2310.01406 (2023)
Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z., Zhang, L.: Dreamtime: an improved optimization strategy for text-to-3D content creation. CoRR abs/2306.12422 (2023)
Huang, Y., et al.: Dreamwaltz: make a scene with complex 3d animatable avatars. In: NeurIPS 2023 (2023)
Ichim, A.E., Bouaziz, S., Pauly, M.: Dynamic 3D avatar creation from hand-held video input. ACM Trans. Graph. 34(4), 45:1–45:14 (2015)
Jiang, R., et al.: Avatarcraft: transforming text into neural human avatars with parameterized shape and pose control. In: ICCV 2023, pp. 14325–14336 (2023)
Jun, H., Nichol, A.: Shap-e: generating conditional 3D implicit functions. CoRR abs/2305.02463 (2023)
Karnewar, A., Vedaldi, A., Novotný, D., Mitra, N.J.: HOLODIFFUSION: training a 3D diffusion model using 2D images. In: CVPR 2023, pp. 18423–18433 (2023)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4217–4228 (2021)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR 2020, pp. 8107–8116 (2020)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 139:1–139:14 (2023)
Kim, H., et al.: Deep video portraits. ACM Trans. Graph. 37(4), 163 (2018)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. ArXiv abs/2305.01569 (2023). https://api.semanticscholar.org/CorpusID:258437096
Kolotouros, N., Alldieck, T., Zanfir, A., Bazavan, E.G., Fieraru, M., Sminchisescu, C.: Dreamhuman: animatable 3D avatars from text. In: NeurIPS 2023 (2023)
Lin, C., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR 2023, pp. 300–309 (2023)
Liu, N., Li, S., Du, Y., Tenenbaum, J.B., Torralba, A.: Learning to compose visual relations. ArXiv abs/2111.09297 (2021). https://api.semanticscholar.org/CorpusID:244270027
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. ArXiv abs/2206.01714 (2022). https://api.semanticscholar.org/CorpusID:249375227
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: ICCV 2023, pp. 9264–9275 (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. CoRR abs/2309.03453 (2023)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248:1–248:16 (2015)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: CVPR 2021, pp. 2837–2845 (2021)
Mendiratta, M., et al.: Avatarstudio: text-driven editing of 3D dynamic human head avatars. ACM Trans. Graph. 42(6), 226:1–226:18 (2023)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3D point clouds from complex prompts. CoRR abs/2212.08751 (2022)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. In: ICLR 2023 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2021, vol. 139, pp. 8748–8763 (2021)
Raj, A., et al.: DreamBooth3D: subject-driven text-to-3D generation (2023)
Raj, A., et al.: Pixel-aligned volumetric avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11733–11742 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR 2022, pp. 10674–10685 (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR 2023, pp. 22500–22510 (2023)
Shen, T., Gao, J., Yin, K., Liu, M., Fidler, S.: Deep marching tetrahedra: a hybrid representation for high-resolution 3D shape synthesis. In: NeurIPS 2021, pp. 6087–6101 (2021)
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterFaceGAN: interpreting the disentangled face representation learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 2004–2018 (2022)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. CoRR abs/2308.16512 (2023)
Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D neural field generation using triplane diffusion. In: CVPR 2023, pp. 20875–20886 (2023)
Tang, J., et al.: Make-it-3D: high-fidelity 3D creation from a single image with diffusion prior. In: ICCV 2023, pp. 22762–22772 (2023)
Tewari, A., et al.: FML: face model learning from videos. In: CVPR 2019, pp. 10812–10822 (2019)
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of RGB videos. In: CVPR 2016, pp. 2387–2395 (2016)
Trevithick, A., et al.: Real-time radiance fields for single-image portrait view synthesis. ACM Trans. Graph. 42(4), 135:1–135:15 (2023)
Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F.: TextMesh: generation of realistic 3d meshes from text prompts. CoRR abs/2304.12439 (2023)
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. IEEE TVCG (2023)
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: Nerf-art: text-driven neural radiance fields stylization. IEEE Trans. Vis. Comput. Graph. 01, 1–15 (2023)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR 2023, pp. 12619–12629 (2023)
Wang, T., et al.: RODIN: a generative model for sculpting 3D digital avatars using diffusion. In: CVPR 2023, pp. 4563–4573 (2023)
Wang, Z., et al.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. In: NeurIPS 2023 (2023)
Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: ICLR 2023 (2023)
Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Graph. 30(4), 77 (2011)
Xu, Y., Yang, Z., Yang, Y.: SEEAvatar: photorealistic text-to-3D avatar generation with constrained geometry and appearance. CoRR abs/2312.08889 (2023)
Zeng, Y., Lu, Y., Ji, X., Yao, Y., Zhu, H., Cao, X.: Avatarbooth: high-quality and customizable 3d human avatar generation. CoRR abs/2306.09864 (2023)
Zhang, L., et al.: DreamFace: progressive generation of animatable 3D faces under text guidance. ACM Trans. Graph. 42(4), 138:1–138:16 (2023)
Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: I M avatar: implicit morphable head avatars from videos. In: CVPR 2022, pp. 13535–13545 (2022)
Zhou, Z., Tulsiani, S.: SparseFusion: distilling view-conditioned diffusion for 3D reconstruction. In: CVPR 2023, pp. 12588–12597 (2023)
Zhu, P., Abdal, R., Femiani, J., Wonka, P.: Mind the gap: domain gap control for single shot domain adaptation for generative adversarial networks. In: ICLR 2022 (2022)
Acknowledgements
We would like to thank Prof. Octavia Camps and ONR grant N00014-21-1-2431 from NCI for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Comas-Massagué, A. et al. (2025). MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15124. Springer, Cham. https://doi.org/10.1007/978-3-031-72848-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-72848-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72847-1
Online ISBN: 978-3-031-72848-8
eBook Packages: Computer ScienceComputer Science (R0)