Tuning-Free Image Customization with Image and Text Guidance

Li, Pengzhi; Nie, Qiang; Chen, Ying; Jiang, Xi; Wu, Kai; Lin, Yuhuan; Liu, Yong; Peng, Jinlong; Wang, Chengjie; Zheng, Feng

doi:10.1007/978-3-031-73116-7_14

Pengzhi Li¹³,
Qiang Nie^14,15,
Ying Chen¹⁵,
Xi Jiang¹⁶,
Kai Wu¹⁵,
Yuhuan Lin¹⁵,
Yong Liu¹⁵,
Jinlong Peng¹⁵,
Chengjie Wang¹⁵ &
…
Feng Zheng^16,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

European Conference on Computer Vision

127 Accesses

Abstract

Despite significant advancements in image customization with diffusion models, current methods still have several limitations: 1) unintended changes in non-target areas when regenerating the entire image; 2) guidance solely by a reference image or text descriptions; and 3) time-consuming fine-tuning, which limits their practical application. In response, we introduce a tuning-free framework for simultaneous text-image-guided image customization, enabling precise editing of specific image regions within seconds. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. To achieve this, we propose an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. To our knowledge, this is the first tuning-free method that concurrently utilizes text and image guidance for image customization in specific regions. Our approach outperforms previous methods in both human and quantitative evaluations, providing an efficient solution for various practical applications, such as image synthesis, design, and creative photography. Project page: https://zrealli.github.io/TIGIC.

P. Li and Q. Nie—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Controlling Attention Map Better for Text-Guided Image Editing Diffusion Models

Masked-attention diffusion guidance for spatially controlling text-to-image generation

Article 30 November 2023

Decoupling Control in Text-to-Image Diffusion Models

References

Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)
Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. 42(4), 1–11 (2023)
Article Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR, pp. 18208–18218 (2022)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
Google Scholar
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826 (2023)
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023)
Cong, W., et al.: Dovenet: deep image harmonization via domain verification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8394–8403 (2020)
Google Scholar
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
Cun, X., Pun, C.M.: Improving the harmony of the composite image by spatial-separated attention module. IEEE Trans. Image Process. 29, 4759–4771 (2020)
Article Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
Han, I., Yang, S., Kwon, T., Ye, J.C.: Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767 (2023)
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
Hong, Y., Niu, L., Zhang, J.: Shadow generation for composite image in real-world scenes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 914–922 (2022)
Google Scholar
Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276 (2022)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941 (2023)
Google Scholar
Li, P., Huang, Q., Ding, Y., Li, Z.: Layerdiffusion:layered controlled image editing with diffusion models. arXiv preprint arXiv:2305.18676 (2023)
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV 2014, Part V 13, pp. 740–755. Springer, Cham (2014)
Google Scholar
Liu, D., Long, C., Zhang, H., Yu, H., Dong, X., Xiao, C.: Arshadowgan: shadow generative adversarial network for augmented reality in single light scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8139–8148 (2020)
Google Scholar
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2294–2305 (2023)
Google Scholar
Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI 2015, Part III 18, pp. 234–241. Springer, Cham (2015)
Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
Google Scholar
Tripathi, S., Chandra, S., Agrawal, A., Tyagi, A., Rehg, J.M., Chari, V.: Learning to generate synthetic data via compositing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 461–470 (2019)
Google Scholar
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572 (2022)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smartbrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)
Google Scholar
Xue, B., Ran, S., Chen, Q., Jia, R., Zhao, B., Tang, X.: DCCF: deep comprehensible color filter learning framework for high-resolution image harmonization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VII, pp. 300–316. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_18
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18381–18391 (2023)
Google Scholar
Zhang, L., Wen, T., Min, J., Wang, J., Han, D., Shi, J.: Learning object placement by inpainting for compositional data augmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 566–581. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_34
Chapter Google Scholar
Zhang, L., Wen, T., Shi, J.: Deep image blending. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 231–240 (2020)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Google Scholar
Zhang, Y., et al.: Prospect: expanded conditioning for the personalization of attribute-aware image generation. arXiv preprint arXiv:2305.16225 (2023)

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant NO. 62122035)

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Pengzhi Li
HKUST (GZ), Guangzhou, China
Qiang Nie
Tencent Youtu Lab, Shanghai, China
Qiang Nie, Ying Chen, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng & Chengjie Wang
Southern University of Science and Technology, Shenzhen, China
Xi Jiang & Feng Zheng
Research Institute of Multiple Agents and Embodied Intelligence, Peng Cheng Laboratory, Shenzhen, China
Feng Zheng

Authors

Pengzhi Li
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Nie
View author publications
You can also search for this author in PubMed Google Scholar
Ying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xi Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yuhuan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jinlong Peng
View author publications
You can also search for this author in PubMed Google Scholar
Chengjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Zheng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11756 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P. et al. (2025). Tuning-Free Image Customization with Image and Text Guidance. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-73116-7_14
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73115-0
Online ISBN: 978-3-031-73116-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Tuning-Free Image Customization with Image and Text Guidance

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Controlling Attention Map Better for Text-Guided Image Editing Diffusion Models

Masked-attention diffusion guidance for spatially controlling text-to-image generation

Decoupling Control in Text-to-Image Diffusion Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 11756 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Tuning-Free Image Customization with Image and Text Guidance

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Controlling Attention Map Better for Text-Guided Image Editing Diffusion Models

Masked-attention diffusion guidance for spatially controlling text-to-image generation

Decoupling Control in Text-to-Image Diffusion Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 11756 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation