research-article

Open access

SBCR: Stochasticity Beats Content Restriction Problem in Training and Tuning Free Image Editing

Authors:

Jiancheng Huang,

Shifeng ChenAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 878 - 887

https://doi.org/10.1145/3652583.3658020

Published: 07 June 2024 Publication History

Abstract

Text-conditional image editing is a practical AIGC task that has recently emerged with great commercial and academic value. For real image editing, most diffusion model-based methods use DDIM Inversion as a first stage before editing. However, DDIM Inversion often results in reconstruction failure, leading to unsatisfactory performance for downstream editing. Many inversion-based works modify the formula to address this problem but this leads to another content restriction problem. To solve the content restriction problem, we first analyze why the reconstruction via DDIM Inversion fails and then propose Reconstruction-and-Generation Balancing Noises (R&G-B noises) that can achieve superior reconstruction and editing performance with the following advantages: 1) It can perfectly reconstruct real images without fine-tuning. 2) It can overcome the content restriction problem and generate diverse content.

References

[1]

Omri Avrahami, Ohad Fried, and Dani Lischinski. 2022a. Blended latent diffusion. arXiv preprint arXiv:2206.02779 (2022).

[2]

Omri Avrahami, Ohad Fried, and Dani Lischinski. 2022b. Blended Latent Diffusion. arXiv preprint arXiv:2206.02779 (2022).

[3]

Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinário Passos. 2023. LEDITS: Limitless Image Editing using Text-to-Image Models. arXiv preprint arXiv:2311.16711 (2023).

[4]

Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. In CVPR.

[5]

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In ICCV.

[6]

Songyan Chen and Jiancheng Huang. 2023 a. FEC: Three Finetuning-free Methods to Enhance Consistency for Real Image Editing. arXiv preprint arXiv:2309.14934 (2023).

[7]

Songyan Chen and Jiancheng Huang. 2023 b. SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing. In 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML). IEEE, 369--375.

[8]

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. 2022. DiffEdit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022).

[9]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. NeurIPS, Vol. 34 (2021), 8780--8794.

[10]

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. 2023. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023).

[11]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR.

[12]

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. In CVPR.

[13]

L. Han, et al. 2024. ProxEdit: Improving Tuning-Free Real Image Editing With Proximal Guidance. In WACV.

[14]

Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Yuxiao Chen, Di Liu, Qilong Zhangli, et al. 2023. Improving Negative-Prompt Inversion via Proximal Guidance. arXiv preprint arXiv:2306.05414 (2023).

[15]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. In ICLR.

[16]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In NeurIPS.

[17]

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. 2022. Cascaded Diffusion Models for High Fidelity Image Generation. The Journal of Machine Learning Research, Vol. 23 (2022), 2249--2281.

Digital Library

[18]

Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.

[19]

Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).

[20]

Jiancheng Huang, Yifan Liu, and Shifeng Chen. 2023 b. Bootstrap Diffusion Model Curve Estimation for High Resolution Low-Light Image Enhancement. arXiv preprint arXiv:2309.14709 (2023).

[21]

Jiancheng Huang, Yifan Liu, Yi Huang, and Shifeng Chen. 2023 c. Seal2Real: Prompt Prior Learning on Diffusion Model for Unsupervised Document Seal Data Generation and Realisation. arXiv preprint arXiv:2310.00546 (2023).

[22]

Jiancheng Huang, Yifan Liu, Jin Qin, and Shifeng Chen. 2023 d. KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing. arxiv: 2309.16608 [cs.CV]

[23]

Yi Huang, Jiancheng Huang, Jianzhuang Liu, Yu Dong, Jiaxi Lv, and Shifeng Chen. 2023 a. WaveDM: Wavelet-Based Diffusion Models for Image Restoration. arXiv preprint arXiv:2305.13819 (2023).

[24]

Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. 2023. An Edit Friendly DDPM Noise Space: Inversion and Manipulations. arXiv preprint arXiv:2304.06140 (2023).

[25]

Betker James, Goh Gabriel, Jing Li, Brooks Tim, Wang Jianfeng, Li Linjie, Ouyang Long, and et.al. 2023. Improving Image Generation with Better Captions. (2023). https://cdn.openai.com/papers

[26]

Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. 2024. Training-Free Content Injection Using H-Space in Diffusion Models. In WACV. 5151--5161.

[27]

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-based real image editing with diffusion models. In CVPR.

[28]

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. 2023. Multi-Concept Customization of Text-to-Image Diffusion. In CVPR.

[29]

Hyunsoo Lee, Minsoo Kang, and Bohyung Han. 2023. Conditional Score Guidance for Text-Driven Image-to-Image Translation. NIPS (2023).

[30]

Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: Connecting text and images with event structures. In CVPR.

[31]

Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. 2023. Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2294--2305.

[32]

B. Meiri, et al. 2023. Fixed-point Inversion for Text-to-image diffusion models. arXiv (2023).

[33]

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. Sdedit: Image synthesis and editing with stochastic differential equations. In ICLR.

[34]

Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. 2023. Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models. arXiv preprint arXiv:2305.16807 (2023).

[35]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In CVPR.

[36]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML.

[37]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In ICML. PMLR, 8162--8171.

[38]

Z. Pan, et al. 2023. Effective real image editing with accelerated iterative diffusion inversion. In ICCV.

[39]

Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, and Trevor Darrell. 2024. Shape-Guided Diffusion With Inside-Outside Attention. In WACV. 4198--4207.

[40]

Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, and Jong Chul Ye. 2023. Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models. NIPS (2023).

[41]

Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. 2023. Zero-shot Image-to-Image Translation. In ACM SIGGRAPH.

[42]

Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or. 2023. Localizing Object-level Shape Variations with Text-to-Image Diffusion Models. ICCV (2023).

[43]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).

[44]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.

[45]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.

[46]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS.

[47]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML.

[48]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020a. Denoising diffusion implicit models. In ICLR.

[49]

Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. In NeurIPS.

[50]

Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. NeurIPS (2020).

[51]

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020b. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).

[52]

Linoy Tsaban and Apolinário Passos. 2023. LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance. arxiv: 2307.00522 [cs.CV]

[53]

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR.

[54]

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. 2023. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. In ICCV.

[55]

Chen Henry Wu and Fernando De la Torre. 2023. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7378--7387.

[56]

Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. 2023. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. arXiv preprint arXiv:2305.10431 (2023).

[57]

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research (2022).

[58]

Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, and Bin Cui. 2024. FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference. AAAI (2024).

[59]

Lvmin Zhang and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. ICCV (2023).

[60]

Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Huang, and Wenjing Yang. 2023. Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator. arXiv preprint arXiv:2305.06710 (2023).

Cited By

Huang WTu SXu L(2025)PFB-Diff: Progressive Feature Blending diffusion for text-driven image editingNeural Networks10.1016/j.neunet.2024.106777181(106777)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106777

Index Terms

SBCR: Stochasticity Beats Content Restriction Problem in Training and Tuning Free Image Editing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition

Recommendations

BK-Editer: Body-Keeping Text-Conditioned Real Image Editing
Computational Visual Media
Abstract
With the firestorm of generative macromodelling, text-conditional image editing is a recently emerged and highly useful task with an unlimited future. Although a lot of research progress has been made, most of the methods still fail to achieve ...
KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing
Pattern Recognition and Computer Vision
Abstract
Text-conditioned image editing is a recently emerged and highly practical task, and its potential is immeasurable. However, most of the concurrent methods are unable to perform action editing, i.e. they can not produce results that conform to the ...
Free appearance-editing with improved poisson image cloning
Special issue on Natural Language Processing

In this paper, we present a new edit tool for the user to conveniently preserve or freely edit the object appearance during seamless image composition. We observe that though Poisson image editing is effective for seamless image composition. Its color ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NoDerivatives International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Shenzhen Science and Technology Innovation Commission

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)40

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang WTu SXu L(2025)PFB-Diff: Progressive Feature Blending diffusion for text-driven image editingNeural Networks10.1016/j.neunet.2024.106777181(106777)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106777

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents