DreamReward: Text-to-3D Generation with Human Preference

Ye, JunLiang; Liu, Fangfu; Li, Qixiu; Wang, Zhengyi; Wang, Yikai; Wang, Xinzhou; Duan, Yueqi; Zhu, Jun

doi:10.1007/978-3-031-72897-6_15

JunLiang Ye^13,14,
Fangfu Liu¹³,
Qixiu Li¹³,
Zhengyi Wang^13,14,
Yikai Wang¹³,
Xinzhou Wang^13,14,
Yueqi Duan¹³ &
…
Jun Zhu^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

280 Accesses

Abstract

3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D—the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models. Project Page: https://jamesyjl.github.io/DreamReward/.

J. Ye and F. Liu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Diverse Text-to-3D Synthesis with Augmented Text Embedding

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories

References

Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository (2015)
Google Scholar
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22246–22256 (2023)
Google Scholar
Chen, Y., et al.: Gaussianeditor: swift and controllable 3d editing with gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21476–21485 (2024)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects (2022)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Ding, L., et al.: Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5115–5124 (2024)
Google Scholar
Fan, Y., et al.: Reinforcement learning for fine-tuning text-to-image diffusion models. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Guo, Y.C., et al.: Threestudio: a unified framework for 3d content generation (2023). https://github.com/threestudio-project/threestudio
Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: triplane latent diffusion for textured mesh generation (2023)
Google Scholar
He, Y., et al.: T$^3$bench: benchmarking current progress in text-to-3d generation (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3d meshes from 2d text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7909–7920 (2023)
Google Scholar
Höllein, L., et al.: Viewdiff: 3d-consistent image generation with text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5043–5052 (2024)
Google Scholar
Hong, Y., et al.: Lrm: large reconstruction model for single image to 3d (2023)
Google Scholar
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)
Google Scholar
Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions (2023)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. arxiv:2310.02596 (2023)
Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6517–6526 (2024)
Google Scholar
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Liu, F., Wu, D., Wei, Y., Rao, Y., Duan, Y.: Sherpa3d: boosting high-fidelity text-to-3d generation via coarse 3d prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20763–20774 (2024)
Google Scholar
Liu, F., et al.: Learning to summarize from human feedback. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Liu, M., et al.: One-2-3-45++: fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)
Google Scholar
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Google Scholar
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)
Google Scholar
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models (2022)
Google Scholar
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3d point clouds from complex prompts (2022)
Google Scholar
Achiam, J., Adler, S., Agarwal, S., et al.: OpenAI: Gpt-4 technical report (2023)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents, 1(2), 3 (2022). arXiv preprint arXiv:2204.06125
Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. Technical report, Google (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
Team, G., Anil, R., Borgeaud, S., et al.: Gemini: a family of highly capable multimodal models (2023)
Google Scholar
Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: Polos: Multimodal metric learning from human feedback for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13559–13568 (2024)
Google Scholar
Wallace, B., et al.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238 (2024)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
Google Scholar
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Wang, Z., et al.: Crm: single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034 (2024)
Wei, M., Zhou, J., Sun, J., Zhang, X.: Adversarial score distillation: when score distillation meets gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8131–8141 (2024)
Google Scholar
Wu, T., et al.: Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22227–22238 (2024)
Google Scholar
Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Yang, K., et al.: Using human feedback to fine-tune diffusion models without any reward model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951 (2024)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhao, R., Wang, Z., Wang, Y., Zhou, Z., Zhu, J.: Flexidreamer: single image-to-3d generation with flexicubes. arXiv preprint arXiv:2404.00987 (2024)
Zhu, J., Zhuang, P.: Hifa: high-fidelity text-to-3d generation with advanced diffusion guidance (2023)
Google Scholar
Zhu, Z., et al.: Diffusion models for reinforcement learning: a survey (2024)
Google Scholar
Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: Dreameditor: text-driven 3d scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)
Google Scholar
Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
JunLiang Ye, Fangfu Liu, Qixiu Li, Zhengyi Wang, Yikai Wang, Xinzhou Wang, Yueqi Duan & Jun Zhu
ShengShu, Beijing, China
JunLiang Ye, Zhengyi Wang, Xinzhou Wang & Jun Zhu

Authors

JunLiang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Fangfu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qixiu Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yikai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinzhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yueqi Duan
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yueqi Duan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2828 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, J. et al. (2025). DreamReward: Text-to-3D Generation with Human Preference. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_15
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DreamReward: Text-to-3D Generation with Human Preference