Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Published: 26 July 2023 Publication History

Abstract

Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-driven image editing methods usually need edit masks, struggle with edits that require significant visual changes and cannot easily keep specific details of the edited portion. In this paper we make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image. We also show that initializing the stochastic sampler with a noised version of the base image before the sampling and interpolating relevant details from the base image after sampling further increase the quality of the edit operation. Combining these observations, we propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image. UniTune does not require additional inputs, like masks or sketches, and can perform multiple edits on the same image without retraining. We test our method using the Imagen model in a range of different use cases. We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.

Supplementary Material

ZIP File (papers_806-supplemental.zip)
supplemental material
MP4 File (papers_806_VOD.mp4)
presentation

References

[1]
Rameen Abdal, Peihao Zhu, John Femiani, Niloy J. Mitra, and Peter Wonka. 2021. CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions.
[2]
Omri Avrahami, Ohad Fried, and Dani Lischinski. 2022. Blended Latent Diffusion.
[3]
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2021. Blended Diffusion for Text-driven Editing of Natural Images.
[4]
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2LIVE: Text-Driven Layered Image and Video Editing. arXiv preprint arXiv:2204.02491 (2022).
[5]
David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. 2021. Paint by Word.
[6]
Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. 2016. Neural Photo Editing with Introspective Adversarial Networks.
[7]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:2211.09800 [cs.CV]
[8]
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. 2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models.
[9]
Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis.
[10]
Ziyi Dong, Pengxu Wei, and Liang Lin. 2023. DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning. arXiv:2211.11337 [cs.CV]
[11]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.
[12]
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:2108.00946 [cs.CV]
[13]
Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks.
[14]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks.
[15]
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control.
[16]
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. 2022. Imagen Video: High Definition Video Generation with Diffusion Models.
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models.
[18]
Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance.
[19]
Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks.
[20]
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv:2210.09276 [cs.CV]
[21]
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2021. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation.
[22]
Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2021. More Control for Free! Image Synthesis with Semantic Diffusion Guidance.
[23]
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.
[24]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.
[25]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery.
[26]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision.
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1--67. http://jmlr.org/papers/v21/20-074.html
[28]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents.
[29]
Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. 2021. Pivotal Tuning for Latent-based Editing of Real Images.
[30]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
[31]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation.
[32]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation.
[33]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.
[34]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. 2021. Image Super-Resolution via Iterative Refinement. arXiv:2104.07636 [eess.IV]
[35]
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics.
[36]
Yang Song and Stefano Ermon. 2019. Generative Modeling by Estimating Gradients of the Data Distribution.
[37]
David Stap, Maurits Bleeker, Sarah Ibrahimi, and Maartje ter Hoeve. 2020. Conditional Image Generation and Manipulation for User-Specified Content.
[38]
Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. 2022. Pretraining is All You Need for Image-to-Image Translation. In arXiv.
[39]
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2020. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation.
[40]
Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. 2021. GAN Inversion: A Survey.
[41]
Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. 2016. Generative Visual Manipulation on the Natural Image Manifold.

Cited By

View all
  • (2024)Training-Free Layout Control with Cross-Attention Guidance2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00526(5331-5341)Online publication date: 3-Jan-2024
  • (2024)Shape-Guided Diffusion with Inside-Outside Attention2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00415(4186-4195)Online publication date: 3-Jan-2024
  • (2024)MMGInpainting: Multi-Modality Guided Image Inpainting Based on Diffusion ModelsIEEE Transactions on Multimedia10.1109/TMM.2024.338248426(8811-8823)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 42, Issue 4
      August 2023
      1912 pages
      ISSN:0730-0301
      EISSN:1557-7368
      DOI:10.1145/3609020
      Issue’s Table of Contents
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 July 2023
      Published in TOG Volume 42, Issue 4

      Check for updates

      Author Tags

      1. image editing
      2. diffusion models

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)363
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 03 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Training-Free Layout Control with Cross-Attention Guidance2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00526(5331-5341)Online publication date: 3-Jan-2024
      • (2024)Shape-Guided Diffusion with Inside-Outside Attention2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00415(4186-4195)Online publication date: 3-Jan-2024
      • (2024)MMGInpainting: Multi-Modality Guided Image Inpainting Based on Diffusion ModelsIEEE Transactions on Multimedia10.1109/TMM.2024.338248426(8811-8823)Online publication date: 2024
      • (2024)Data Redaction from Conditional Generative Models2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML59370.2024.00035(569-591)Online publication date: 9-Apr-2024
      • (2024)Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10581912(1-9)Online publication date: 27-May-2024
      • (2024)ConIS: controllable text-driven image stylization with semantic intensityMultimedia Systems10.1007/s00530-024-01381-130:4Online publication date: 13-Jun-2024
      • (2023)ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion ModelsACM Transactions on Graphics10.1145/361834242:6(1-14)Online publication date: 5-Dec-2023
      • (2023)StableVideo: Text-driven Consistency-aware Diffusion Video Editing2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.02106(22983-22993)Online publication date: 1-Oct-2023
      • (2023)The Stable Signature: Rooting Watermarks in Latent Diffusion Models2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.02053(22409-22420)Online publication date: 1-Oct-2023
      • (2023)Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00683(7396-7406)Online publication date: 1-Oct-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media