Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent progress in personalizing text-to-image (T2I) diffusion models has demonstrated their capability to generate images based on personalized visual concepts using only a few user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly when modifying scenes according to textual descriptions. To address this challenge, we introduce ComFusion, an innovative approach that leverages pretrained models to create compositions of user-supplied subject images and predefined text scenes. ComFusion incorporates a class-scene prior preservation regularization, which combines subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse-generated images to ensure alignment with both the instance images and scene texts, thereby achieving a delicate balance between capturing the subject’s essence and maintaining scene fidelity. Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arar, M., et al.: Domain-agnostic tuning-encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06925 (2023)

  2. Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)

  3. Blattmann, A., et al.: Align your Latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)

    Google Scholar 

  4. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  5. Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050 (2023)

    Google Scholar 

  6. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 42(4), 1–10 (2023)

    Article  Google Scholar 

  7. Chen, H., Zhang, Y., Wang, X., Duan, X., Zhou, Y., Zhu, W.: DisenBooth: identity-preserving disentangled tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374 (2023)

  8. Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)

  9. Clouâtre, L., Demers, M.: FIGR: few-shot image generation with reptile. arXiv preprint arXiv:1901.02199 (2019)

  10. Ding, G., et al.: Attribute group editing for reliable few-shot image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11194–11203 (2022)

    Google Scholar 

  11. Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356 (2023)

    Google Scholar 

  12. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  13. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. ACM Trans. Graph. (TOG) 42(4), 1–13 (2023)

    Article  Google Scholar 

  14. Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941 (2023)

    Google Scholar 

  15. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)

    Google Scholar 

  16. Gu, Y., et al.: Mix-of-show: decentralized low-rank adaptation for multi-concept customization of diffusion models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  17. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  18. Hong, Y., Niu, L., Zhang, J., Zhang, L.: MatchingGAN: matching-based few-shot image generation. In: ICME (2020)

    Google Scholar 

  19. Hong, Y., Niu, L., Zhang, J., Zhang, L.: DeltaGAN: towards diverse few-shot image generation with sample-specific delta. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13676, pp. 259–276. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_15

  20. Hong, Y., Niu, L., Zhang, J., Zhao, W., Fu, C., Zhang, L.: F2GAN: fusing-and-filling GAN for few-shot image generation. In: ACM MM (2020)

    Google Scholar 

  21. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  22. Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)

  23. Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134 (2023)

    Google Scholar 

  24. Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22691–22702 (2023)

    Google Scholar 

  25. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)

    Google Scholar 

  26. Lee, J., Cho, K., Kiela, D.: Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019)

  27. Li, D., Li, J., Hoi, S.C.: BLIP-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720 (2023)

  28. Li, L., Zhang, Y., Wang, S.: The Euclidean space is evil: hyperbolic attribute editing for few-shot image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22714–22724 (2023)

    Google Scholar 

  29. Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

    Google Scholar 

  30. Liu, K., Tang, W., Zhou, F., Qiu, G.: Spectral regularization for combating mode collapse in GANs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6382–6390 (2019)

    Google Scholar 

  31. Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125 (2023)

  32. Lu, Y., Singhal, S., Strub, F., Courville, A., Pietquin, O.: Countering language drift with seeded iterated learning. In: International Conference on Machine Learning, pp. 6437–6447. PMLR (2020)

    Google Scholar 

  33. Mou, C., et al.: T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

  34. Qiu, Z., et al.: Controlling text-to-image diffusion by orthogonal finetuning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  36. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2), 3 (2022)

  37. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  38. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, Part III 18. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  39. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  40. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  41. Shi, J., Xiong, W., Lin, Z., Jung, H.J.: InstantBooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)

  42. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)

  43. Smith, J.S., et al.: Continual diffusion: continual customization of text-to-image diffusion with C-LoRA. arXiv preprint arXiv:2304.06027 (2023)

  44. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  45. Srivastava, A., Valkov, L., Russell, C., Gutmann, M.U., Sutton, C.: VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  46. Tao, M., Bao, B.K., Tang, H., Xu, C.: GALIP: generative adversarial clips for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14214–14223 (2023)

    Google Scholar 

  47. Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)

    Google Scholar 

  48. Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

    Google Scholar 

  49. Thanh-Tung, H., Tran, T.: Catastrophic forgetting and mode collapse in GANs. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–10. IEEE (2020)

    Google Scholar 

  50. Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)

    Google Scholar 

  51. Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: \({p}{+}\): extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)

  52. Wang, T., et al.: RODIN: a generative model for sculpting 3D digital avatars using diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4563–4573 (2023)

    Google Scholar 

  53. Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: ELITE: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)

  54. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)

    Google Scholar 

  55. Wu, Q., et al.: Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7766–7776 (2023)

    Google Scholar 

  56. Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)

    Google Scholar 

  57. Xu, J., et al.: Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20908–20918 (2023)

    Google Scholar 

  58. Xu, X., Wang, Z., Zhang, G., Wang, K., Shi, H.: Versatile diffusion: text, images and variations all in one diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7754–7765 (2023)

    Google Scholar 

  59. Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  60. Yang, M., Wang, Z., Chi, Z., Feng, W.: WaveGAN: frequency-aware GAN for high-fidelity few-shot image generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13675, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_1

  61. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  62. Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: DreamEditor: text-driven 3D scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianfu Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 29489 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hong, Y. et al. (2025). ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72784-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72783-2

  • Online ISBN: 978-3-031-72784-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics