Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

INSTASTYLE: Inversion Noise of a Stylized Image is Secretly a Style Adviser

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15109))

Included in the following conference series:

Abstract

Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the “style” noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style during image inversion. To address this, we devise prompt refinement, which learns a style token assisted by human feedback. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/aim-uofa/StyleDrop-PyTorch.

  2. 2.

    https://github.com/huggingface/diffusers.

References

  1. Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time representation for text-to-image personalization. ACM TOG 42, 1–10 (2023)

    Article  Google Scholar 

  2. Chen, H., et al.: DualAST: dual style-learning networks for artistic style transfer. In: CVPR (2021)

    Google Scholar 

  3. Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: PromptStyler: prompt-driven style generation for source-free domain generalization. In: ICCV (2023)

    Google Scholar 

  4. Cui, X., Li, P., Li, Z., Liu, X., Zou, Y., He, Z.: Localize, understand, collaborate: semantic-aware dragging via intention reasoner. arXiv preprint arXiv:2406.00432 (2024)

  5. Cui, X., et al.: ChatEdit: towards multi-turn interactive facial image editing via dialogue. In: EMNLP (2023)

    Google Scholar 

  6. Deng, Y., et al.: StyTr2: image style transfer with transformers. In: CVPR (2022)

    Google Scholar 

  7. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  8. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

    Google Scholar 

  9. Everaert, M.N., Bocchio, M., Arpa, S., Süsstrunk, S., Achanta, R.: Diffusion in style. In: ICCV (2023)

    Google Scholar 

  10. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2022)

    Google Scholar 

  11. Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM TOG 41, 1–13 (2022)

    Article  Google Scholar 

  12. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)

    Google Scholar 

  13. Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. In: CVPR (2017)

    Google Scholar 

  14. Goodfellow, I., et al.: Generative adversarial networks. In: CACM (2020)

    Google Scholar 

  15. Gu, B., Fan, H., Zhang, L.: Two birds, one stone: a unified framework for joint learning of image and video style transfers. In: ICCV (2023)

    Google Scholar 

  16. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  17. Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPSW (2021)

    Google Scholar 

  18. Hong, K., et al.: AesPA-Net: aesthetic pattern-aware style transfer networks. In: ICCV (2023)

    Google Scholar 

  19. Hu, E.J., et al.: LoRa: low-rank adaptation of large language models. In: ICLR (2021)

    Google Scholar 

  20. Huang, S., An, J., Wei, D., Luo, J., Pfister, H.: QuantART: quantizing image style transfer towards high visual fidelity. In: CVPR (2023)

    Google Scholar 

  21. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)

    Google Scholar 

  22. Jia, G., Li, P., He, R.: Theme-aware aesthetic distribution prediction with full-resolution photographs. In: TNNLS (2022)

    Google Scholar 

  23. Jing, Y., et al.: Dynamic instance normalization for arbitrary style transfer. In: AAAI (2020)

    Google Scholar 

  24. Jing, Y.: Stroke controllable fast style transfer with adaptive receptive fields. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 244–260. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_15

    Chapter  Google Scholar 

  25. Jing, Y., et al.: Learning graph neural networks for image style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13667, pp. 111–128. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_7

    Chapter  Google Scholar 

  26. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

    Google Scholar 

  27. Ke, Z., Liu, Y., Zhu, L., Zhao, N., Lau, R.W.: Neural preset for color style transfer. In: CVPR (2023)

    Google Scholar 

  28. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  29. Krizhevsky, A., et al.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)

    Google Scholar 

  30. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)

    Google Scholar 

  31. Li, P., Hu, Y., He, R., Sun, Z.: Global and local consistent wavelet-domain age synthesis. TIFS 14, 2943–2957 (2019)

    Google Scholar 

  32. Li, P., Liu, X., Huang, J., Xia, D., Yang, J., Lu, Z.: Progressive generation of 3D point clouds with hierarchical consistency. PR 136, 109200 (2023)

    Google Scholar 

  33. Li, P., Wang, R., Huang, H., He, R., He, Z.: Pluralistic aging diffusion autoencoder. In: ICCV (2023)

    Google Scholar 

  34. Li, P., Wu, X., Hu, Y., He, R., Sun, Z.: M2FPA: a multi-yaw multi-pitch high-quality dataset and benchmark for facial pose analysis. In: ICCV (2019)

    Google Scholar 

  35. Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891 (2023)

  36. Lin, T., et al.: Drafting and revision: Laplacian pyramid network for fast high-quality artistic style transfer. In: CVPR (2021)

    Google Scholar 

  37. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 423–439. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_26

    Chapter  Google Scholar 

  38. Liu, S., et al.: AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: ICCV (2021)

    Google Scholar 

  39. Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: ACM MM (2023)

    Google Scholar 

  40. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2021)

    Google Scholar 

  41. Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023)

    Google Scholar 

  42. Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICLR (2022)

    Google Scholar 

  43. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  44. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  45. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  46. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  47. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

    Google Scholar 

  48. Ruiz, N., et al.: HyperDreamBooth: hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949 (2023)

  49. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2021)

    Google Scholar 

  50. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

    Google Scholar 

  51. Sohn, K., et al.: StyleDrop: text-to-image generation in any style. In: NeurIPS (2023)

    Google Scholar 

  52. Sohn, K., et al.: Learning disentangled prompts for compositional image synthesis. arXiv preprint arXiv:2306.00763 (2023)

  53. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2020)

    Google Scholar 

  54. Tang, H., et al.: Master: meta style transformer for controllable zero-shot and few-shot artistic style transfer. In: CVPR (2023)

    Google Scholar 

  55. Tang, J., et al.: Tri-clustered tensor completion for social-aware image tag refinement. TPAMI 39, 1662–1674 (2016)

    Article  Google Scholar 

  56. Teng, Q., Wang, R., Cui, X., Li, P., He, Z.: Exploring 3D-aware lifespan face aging via disentangled shape-texture representations. In: ICME (2024)

    Google Scholar 

  57. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  58. Wang, H., Li, Y., Wang, Y., Hu, H., Yang, M.H.: Collaborative distillation for ultra-resolution universal style transfer. In: CVPR (2020)

    Google Scholar 

  59. Wang, R., et al.: StableGarment: garment-centric generation via stable diffusion. arXiv preprint arXiv:2403.10783 (2024)

  60. Wang, R., Li, P., Huang, H., Cao, C., He, R., He, Z.: Learning-to-rank meets language: boosting language-driven ordering alignment for ordinal classification. In: NeurIPS (2023)

    Google Scholar 

  61. Wang, Z., Zhao, L., Xing, W.: StyleDiffusion: controllable disentangled style transfer via diffusion models. In: ICCV (2023)

    Google Scholar 

  62. Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In: ICCV (2023)

    Google Scholar 

  63. Wen, L., Gao, C., Zou, C.: CAP-VSTNet: content affinity preserved versatile style transfer. In: CVPR (2023)

    Google Scholar 

  64. Wu, X., Hu, Z., Sheng, L., Xu, D.: StyleFormer: real-time arbitrary style transfer via parametric style composition. In: ICCV (2021)

    Google Scholar 

  65. Wu, Z., Zhu, Z., Du, J., Bai, X.: CCPL: contrastive coherence preserving loss for versatile style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 189–206. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_11

    Chapter  Google Scholar 

  66. Xie, X., Li, Y., Huang, H., Fu, H., Wang, W., Guo, Y.: Artistic style discovery with independent components. In: CVPR (2022)

    Google Scholar 

  67. Xu, W., Long, C., Nie, Y.: Learning dynamic style kernels for artistic style transfer. In: CVPR (2023)

    Google Scholar 

  68. Xu, Z., Sangineto, E., Sebe, N.: StylerDALLE: language-guided style transfer using a vector-quantized tokenizer of a large-scale generative model. In: ICCV (2023)

    Google Scholar 

  69. Yang, S., Hwang, H., Ye, J.C.: Zero-shot contrastive loss for text-guided diffusion image style transfer. In: ICCV (2023)

    Google Scholar 

  70. Zhang, Y., et al.: Inversion-based style transfer with diffusion models. In: CVPR (2023)

    Google Scholar 

  71. Zhang, Y., et al.: Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH (2022)

    Google Scholar 

  72. Zhang, Z., Li, B., Nie, X., Han, C., Guo, T., Liu, L.: Towards consistent video editing with text-to-image diffusion models. In: NeurIPS (2023)

    Google Scholar 

  73. Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for customized text-to-image generation: a regularization-free approach. arXiv preprint arXiv:2305.13579 (2023)

  74. Zhu, M., He, X., Wang, N., Wang, X., Gao, X.: All-to-key attention for arbitrary style transfer. In: ICCV (2023)

    Google Scholar 

Download references

Acknowledgement

This research is sponsored by National Natural Science Foundation of China (Grant No. 62306041, U21B2045, 62176025), Beijing Nova Program (Grant No. Z211100002121106, 20230484488, 20230484276), Youth Innovation Promotion Association CAS (Grant No. 2022132), and Beijing Municipal Science & Technology Commission (Z231100007423015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peipei Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 17799 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cui, X., Li, Z., Li, P., Huang, H., Liu, X., He, Z. (2025). INSTASTYLE: Inversion Noise of a Stylized Image is Secretly a Style Adviser. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15109. Springer, Cham. https://doi.org/10.1007/978-3-031-72983-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72983-6_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72982-9

  • Online ISBN: 978-3-031-72983-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics