Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

DreamReward: Text-to-3D Generation with Human Preference

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

  • 280 Accesses

Abstract

3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D—the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models. Project Page: https://jamesyjl.github.io/DreamReward/.

J. Ye and F. Liu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)

  3. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository (2015)

    Google Scholar 

  4. Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22246–22256 (2023)

    Google Scholar 

  5. Chen, Y., et al.: Gaussianeditor: swift and controllable 3d editing with gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21476–21485 (2024)

    Google Scholar 

  6. Deitke, M., et al.: Objaverse: a universe of annotated 3d objects (2022)

    Google Scholar 

  7. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)

    Google Scholar 

  8. Ding, L., et al.: Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5115–5124 (2024)

    Google Scholar 

  9. Fan, Y., et al.: Reinforcement learning for fine-tuning text-to-image diffusion models. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  10. Guo, Y.C., et al.: Threestudio: a unified framework for 3d content generation (2023). https://github.com/threestudio-project/threestudio

  11. Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: triplane latent diffusion for textured mesh generation (2023)

    Google Scholar 

  12. He, Y., et al.: T\(^3\)bench: benchmarking current progress in text-to-3d generation (2023)

    Google Scholar 

  13. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  14. Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3d meshes from 2d text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7909–7920 (2023)

    Google Scholar 

  15. Höllein, L., et al.: Viewdiff: 3d-consistent image generation with text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5043–5052 (2024)

    Google Scholar 

  16. Hong, Y., et al.: Lrm: large reconstruction model for single image to 3d (2023)

    Google Scholar 

  17. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)

    Google Scholar 

  18. Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions (2023)

    Google Scholar 

  19. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  20. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)

    Google Scholar 

  21. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  22. Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. arxiv:2310.02596 (2023)

  23. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6517–6526 (2024)

    Google Scholar 

  24. Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  25. Liu, F., Wu, D., Wei, Y., Rao, Y., Duan, Y.: Sherpa3d: boosting high-fidelity text-to-3d generation via coarse 3d prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20763–20774 (2024)

    Google Scholar 

  26. Liu, F., et al.: Learning to summarize from human feedback. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)

    Google Scholar 

  27. Liu, M., et al.: One-2-3-45++: fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)

    Google Scholar 

  28. Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)

    Google Scholar 

  29. Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)

  30. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)

    Google Scholar 

  31. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  32. Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)

    Google Scholar 

  33. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  34. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models (2022)

    Google Scholar 

  35. Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3d point clouds from complex prompts (2022)

    Google Scholar 

  36. Achiam, J., Adler, S., Agarwal, S., et al.: OpenAI: Gpt-4 technical report (2023)

    Google Scholar 

  37. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)

    Google Scholar 

  38. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  39. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  40. Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  41. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents, 1(2), 3 (2022). arXiv preprint arXiv:2204.06125

  42. Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. Technical report, Google (2019)

    Google Scholar 

  43. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  44. Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

  45. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)

  46. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)

    Google Scholar 

  47. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)

  48. Team, G., Anil, R., Borgeaud, S., et al.: Gemini: a family of highly capable multimodal models (2023)

    Google Scholar 

  49. Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: Polos: Multimodal metric learning from human feedback for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13559–13568 (2024)

    Google Scholar 

  50. Wallace, B., et al.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238 (2024)

    Google Scholar 

  51. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

    Google Scholar 

  52. Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  53. Wang, Z., et al.: Crm: single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034 (2024)

  54. Wei, M., Zhou, J., Sun, J., Zhang, X.: Adversarial score distillation: when score distillation meets gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8131–8141 (2024)

    Google Scholar 

  55. Wu, T., et al.: Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22227–22238 (2024)

    Google Scholar 

  56. Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  57. Yang, K., et al.: Using human feedback to fine-tune diffusion models without any reward model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951 (2024)

    Google Scholar 

  58. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  59. Zhao, R., Wang, Z., Wang, Y., Zhou, Z., Zhu, J.: Flexidreamer: single image-to-3d generation with flexicubes. arXiv preprint arXiv:2404.00987 (2024)

  60. Zhu, J., Zhuang, P.: Hifa: high-fidelity text-to-3d generation with advanced diffusion guidance (2023)

    Google Scholar 

  61. Zhu, Z., et al.: Diffusion models for reinforcement learning: a survey (2024)

    Google Scholar 

  62. Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: Dreameditor: text-driven 3d scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)

    Google Scholar 

  63. Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yueqi Duan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2828 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ye, J. et al. (2025). DreamReward: Text-to-3D Generation with Human Preference. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72897-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72896-9

  • Online ISBN: 978-3-031-72897-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics