Abstract
3D content creation from text prompts has shown remarkable success recently. However, current text-to-3D methods often generate 3D results that do not align well with human preferences. In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. To begin with, we collect 25k expert comparisons based on a systematic annotation pipeline including rating and ranking. Then, we build Reward3D—the first general-purpose text-to-3D human preference reward model to effectively encode human preferences. Building upon the 3D reward model, we finally perform theoretical analysis and present the Reward3D Feedback Learning (DreamFL), a direct tuning algorithm to optimize the multi-view diffusion models with a redefined scorer. Grounded by theoretical proof and extensive experiment comparisons, our DreamReward successfully generates high-fidelity and 3D consistent results with significant boosts in prompt alignment with human intention. Our results demonstrate the great potential for learning from human feedback to improve text-to-3D models. Project Page: https://jamesyjl.github.io/DreamReward/.
J. Ye and F. Liu—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository (2015)
Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22246–22256 (2023)
Chen, Y., et al.: Gaussianeditor: swift and controllable 3d editing with gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21476–21485 (2024)
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Ding, L., et al.: Text-to-3d generation with bidirectional diffusion using both 2d and 3d priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5115–5124 (2024)
Fan, Y., et al.: Reinforcement learning for fine-tuning text-to-image diffusion models. Adv. Neural Inf. Process. Syst. 36 (2024)
Guo, Y.C., et al.: Threestudio: a unified framework for 3d content generation (2023). https://github.com/threestudio-project/threestudio
Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: triplane latent diffusion for textured mesh generation (2023)
He, Y., et al.: T\(^3\)bench: benchmarking current progress in text-to-3d generation (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Höllein, L., Cao, A., Owens, A., Johnson, J., Nießner, M.: Text2room: extracting textured 3d meshes from 2d text-to-image models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7909–7920 (2023)
Höllein, L., et al.: Viewdiff: 3d-consistent image generation with text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5043–5052 (2024)
Hong, Y., et al.: Lrm: large reconstruction model for single image to 3d (2023)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 867–876 (2022)
Jun, H., Nichol, A.: Shap-e: generating conditional 3d implicit functions (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: aligning geometric priors in 2d diffusion for consistent text-to-3d. arxiv:2310.02596 (2023)
Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: towards high-fidelity text-to-3d generation via interval score matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6517–6526 (2024)
Lin, C.H., et al.: Magic3d: high-resolution text-to-3d content creation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Liu, F., Wu, D., Wei, Y., Rao, Y., Duan, Y.: Sherpa3d: boosting high-fidelity text-to-3d generation via coarse 3d prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20763–20774 (2024)
Liu, F., et al.: Learning to summarize from human feedback. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Liu, M., et al.: One-2-3-45++: fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
Liu, Y., et al.: Syncdreamer: generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. Adv. Neural Inf. Process. Syst. 36 (2024)
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D.: Latent-nerf for shape-guided generation of 3d shapes and textures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663–12673 (2023)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models (2022)
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: a system for generating 3d point clouds from complex prompts (2022)
Achiam, J., Adler, S., Agarwal, S., et al.: OpenAI: Gpt-4 technical report (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents, 1(2), 3 (2022). arXiv preprint arXiv:2204.06125
Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. Technical report, Google (2019)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Shi, R., et al.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512 (2023)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023)
Team, G., Anil, R., Borgeaud, S., et al.: Gemini: a family of highly capable multimodal models (2023)
Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: Polos: Multimodal metric learning from human feedback for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13559–13568 (2024)
Wallace, B., et al.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238 (2024)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. Adv. Neural Inf. Process. Syst. 36 (2024)
Wang, Z., et al.: Crm: single image to 3d textured mesh with convolutional reconstruction model. arXiv preprint arXiv:2403.05034 (2024)
Wei, M., Zhou, J., Sun, J., Zhang, X.: Adversarial score distillation: when score distillation meets gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8131–8141 (2024)
Wu, T., et al.: Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22227–22238 (2024)
Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. Adv. Neural Inf. Process. Syst. 36 (2024)
Yang, K., et al.: Using human feedback to fine-tune diffusion models without any reward model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951 (2024)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhao, R., Wang, Z., Wang, Y., Zhou, Z., Zhu, J.: Flexidreamer: single image-to-3d generation with flexicubes. arXiv preprint arXiv:2404.00987 (2024)
Zhu, J., Zhuang, P.: Hifa: high-fidelity text-to-3d generation with advanced diffusion guidance (2023)
Zhu, Z., et al.: Diffusion models for reinforcement learning: a survey (2024)
Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: Dreameditor: text-driven 3d scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)
Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ye, J. et al. (2025). DreamReward: Text-to-3D Generation with Human Preference. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-72897-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)