SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

Gu, Jing; Zhao, Nanxuan; Xiong, Wei; Liu, Qing; Zhang, Zhifei; Zhang, He; Zhang, Jianming; Jung, HyunJoon; Wang, Yilin; Wang, Xin Eric

doi:10.1007/978-3-031-73411-3_23

Jing Gu¹³,
Nanxuan Zhao¹⁴,
Wei Xiong¹⁴,
Qing Liu¹⁴,
Zhifei Zhang¹⁴,
He Zhang¹⁴,
Jianming Zhang¹⁴,
HyunJoon Jung¹⁴,
Yilin Wang¹⁴ &
…
Xin Eric Wang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15090))

Included in the following conference series:

European Conference on Computer Vision

57 Accesses

Abstract

Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion.

J. Gu—This work was partly performed when the first author interned at Adobe.

Y. Wang and X. E. Wang—Equal advising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

References

Avrahami, O., Fried, O., Lischinski, D.: Blended latent diffusion. ACM Trans. Graph. (TOG) 42(4), 1–11 (2023)
Article Google Scholar
Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Retrieval-augmented diffusion models. In: NIPS (2022)
Google Scholar
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: MasaCTRL: tuning-free mutual self-attention control for consistent image synthesis and editing. In: ICCV (2023)
Google Scholar
Chen, H., Zhang, Y., Wang, X., Duan, X., Zhou, Y., Zhu, W.: Disenbooth: identity-preserving disentangled tuning for subject-driven text-to-image generation (2023)
Google Scholar
Chen, W., et al.: Subject-driven text-to-image generation via apprenticeship learning. arXiv (2023)
Google Scholar
Choi, J., Choi, Y., Kim, Y., Kim, J., Yoon, S.: Custom-edit: text-guided image editing with customized diffusion models. arXiv preprint arXiv:2305.15779 (2023)
Crowson, K., et al.: VQGAN-clip: open domain image generation and editing with natural language guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13697, pp. 88–105. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_6
Chapter Google Scholar
Deng, Y., et al.: Stytr2: image style transfer with transformers. In: CVPR (2022)
Google Scholar
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NIPS (2021)
Google Scholar
Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance for controllable image generation. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: ICLR (2023)
Google Scholar
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2023)
Google Scholar
Gu, J., et al.: PhotoSwap: personalized subject swapping in images (2023)
Google Scholar
Gu, Y., et al.: VideoSwap: customized video subject swapping with interactive semantic point correspondence. arXiv preprint arXiv:2312.02087 (2023)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Chapter Google Scholar
Jahn, M., Rombach, R., Ommer, B.: High-resolution complex scene synthesis with transformers. arXiv (2021)
Google Scholar
Jia, X., et al.: Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642 (2023)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. In: Advances in Neural Information Processing Systems (2023)
Google Scholar
Li, T., Ku, M., Wei, C., Chen, W.: DreamEdit: subject-driven image editing. arXiv preprint arXiv:2306.12624 (2023)
Li, Y., et al.: GliGen: open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521 (2023)
Google Scholar
Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through deep image analogy. ACM Trans. Graph. (2017)
Google Scholar
Liu, S., et al.: AdaAttn: revisit attention mechanism in arbitrary neural style transfer. In: ICCV (2021)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
Google Scholar
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
Google Scholar
Nichol, A.Q., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning, pp. 16784–16804. PMLR (2022)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., Cohen-Or, D.: Localizing object-level shape variations with text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Seo, J., Lee, G., Cho, S., Lee, J., Kim, S.: MidMS: matching interleaved diffusion models for exemplar-based image translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2191–2199 (2023)
Google Scholar
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning (2023)
Google Scholar
Simsar, E., Tonioni, A., Xian, Y., Hofmann, T., Tombari, F.: Lime: localized image editing via attention regularization in diffusion models. arXiv preprint arXiv:2312.09256 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)
Google Scholar
Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH 2023 Conference Proceedings. SIGGRAPH ’23 (2023)
Google Scholar
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930 (2023)
Google Scholar
Wang, M., et al.: Example-guided style-consistent image synthesis from semantic labeling. In: CVPR (2019)
Google Scholar
Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: InstantID: zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024)
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18381–18391 (2023)
Google Scholar
Yang, Z., et al.: Reco: region-controlled text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14246–14255 (2023)
Google Scholar
Zeng, Y., et al.: Scenecomposer: any-level semantic image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22468–22478 (2023)
Google Scholar
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)
Google Scholar
Zhang, P., Zhang, B., Chen, D., Yuan, L., Wen, F.: Cross-domain correspondence learning for exemplar-based image translation. In: CVPR, pp. 5143–5153 (2020)
Google Scholar
Zhang, Y., et al.: Inversion-based creativity transfer with diffusion models. arXiv (2022)
Google Scholar
Zhou, X., et al.: Cocosnet v2: full-resolution correspondence learning for image translation. In: CVPR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Santa Cruz, Santa Cruz, USA
Jing Gu & Xin Eric Wang
Adobe, Mountain View, USA
Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung & Yilin Wang

Authors

Jing Gu
View author publications
You can also search for this author in PubMed Google Scholar
Nanxuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Qing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
He Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
HyunJoon Jung
View author publications
You can also search for this author in PubMed Google Scholar
Yilin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Eric Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Gu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4015 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, J. et al. (2025). SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-73411-3_23
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4015 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 4015 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation