Abstract
Recent progress in personalizing text-to-image (T2I) diffusion models has demonstrated their capability to generate images based on personalized visual concepts using only a few user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly when modifying scenes according to textual descriptions. To address this challenge, we introduce ComFusion, an innovative approach that leverages pretrained models to create compositions of user-supplied subject images and predefined text scenes. ComFusion incorporates a class-scene prior preservation regularization, which combines subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse-generated images to ensure alignment with both the instance images and scene texts, thereby achieving a delicate balance between capturing the subject’s essence and maintaining scene fidelity. Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arar, M., et al.: Domain-agnostic tuning-encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06925 (2023)
Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)
Blattmann, A., et al.: Align your Latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23040–23050 (2023)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 42(4), 1–10 (2023)
Chen, H., Zhang, Y., Wang, X., Duan, X., Zhou, Y., Zhu, W.: DisenBooth: identity-preserving disentangled tuning for subject-driven text-to-image generation. arXiv preprint arXiv:2305.03374 (2023)
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)
Clouâtre, L., Demers, M.: FIGR: few-shot image generation with reptile. arXiv preprint arXiv:1901.02199 (2019)
Ding, G., et al.: Attribute group editing for reliable few-shot image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11194–11203 (2022)
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356 (2023)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations (2022)
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. ACM Trans. Graph. (TOG) 42(4), 1–13 (2023)
Ge, S., et al.: Preserve your own correlation: a noise prior for video diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941 (2023)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Gu, Y., et al.: Mix-of-show: decentralized low-rank adaptation for multi-concept customization of diffusion models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Hong, Y., Niu, L., Zhang, J., Zhang, L.: MatchingGAN: matching-based few-shot image generation. In: ICME (2020)
Hong, Y., Niu, L., Zhang, J., Zhang, L.: DeltaGAN: towards diverse few-shot image generation with sample-specific delta. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13676, pp. 259–276. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_15
Hong, Y., Niu, L., Zhang, J., Zhao, W., Fu, C., Zhang, L.: F2GAN: fusing-and-filling GAN for few-shot image generation. In: ACM MM (2020)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)
Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10124–10134 (2023)
Kumari, N., Zhang, B., Wang, S.Y., Shechtman, E., Zhang, R., Zhu, J.Y.: Ablating concepts in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22691–22702 (2023)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)
Lee, J., Cho, K., Kiela, D.: Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019)
Li, D., Li, J., Hoi, S.C.: BLIP-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720 (2023)
Li, L., Zhang, Y., Wang, S.: The Euclidean space is evil: hyperbolic attribute editing for few-shot image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22714–22724 (2023)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Liu, K., Tang, W., Zhou, F., Qiu, G.: Spectral regularization for combating mode collapse in GANs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6382–6390 (2019)
Liu, Z., et al.: Cones: concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125 (2023)
Lu, Y., Singhal, S., Strub, F., Courville, A., Pietquin, O.: Countering language drift with seeded iterated learning. In: International Conference on Machine Learning, pp. 6437–6447. PMLR (2020)
Mou, C., et al.: T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
Qiu, Z., et al.: Controlling text-to-image diffusion by orthogonal finetuning. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 1(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015, Part III 18. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: InstantBooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)
Smith, J.S., et al.: Continual diffusion: continual customization of text-to-image diffusion with C-LoRA. arXiv preprint arXiv:2304.06027 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Srivastava, A., Valkov, L., Russell, C., Gutmann, M.U., Sutton, C.: VEEGAN: Reducing mode collapse in GANs using implicit variational learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Tao, M., Bao, B.K., Tang, H., Xu, C.: GALIP: generative adversarial clips for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14214–14223 (2023)
Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: a simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Thanh-Tung, H., Tran, T.: Catastrophic forgetting and mode collapse in GANs. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–10. IEEE (2020)
Voynov, A., Aberman, K., Cohen-Or, D.: Sketch-guided text-to-image diffusion models. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11 (2023)
Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: \({p}{+}\): extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)
Wang, T., et al.: RODIN: a generative model for sculpting 3D digital avatars using diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4563–4573 (2023)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: ELITE: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Wu, Q., et al.: Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7766–7776 (2023)
Xie, J., et al.: BoxDiff: text-to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452–7461 (2023)
Xu, J., et al.: Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20908–20918 (2023)
Xu, X., Wang, Z., Zhang, G., Wang, K., Shi, H.: Versatile diffusion: text, images and variations all in one diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7754–7765 (2023)
Xue, Z., et al.: RAPHAEL: text-to-image generation via large mixture of diffusion paths. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Yang, M., Wang, Z., Chi, Z., Feng, W.: WaveGAN: frequency-aware GAN for high-fidelity few-shot image generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13675, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_1
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhuang, J., Wang, C., Lin, L., Liu, L., Li, G.: DreamEditor: text-driven 3D scene editing with neural fields. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hong, Y. et al. (2025). ComFusion: Enhancing Personalized Generation by Instance-Scene Compositing and Fusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-72784-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72783-2
Online ISBN: 978-3-031-72784-9
eBook Packages: Computer ScienceComputer Science (R0)