Abstract
Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the “style” noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style during image inversion. To address this, we devise prompt refinement, which learns a style token assisted by human feedback. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time representation for text-to-image personalization. ACM TOG 42, 1–10 (2023)
Chen, H., et al.: DualAST: dual style-learning networks for artistic style transfer. In: CVPR (2021)
Cho, J., Nam, G., Kim, S., Yang, H., Kwak, S.: PromptStyler: prompt-driven style generation for source-free domain generalization. In: ICCV (2023)
Cui, X., Li, P., Li, Z., Liu, X., Zou, Y., He, Z.: Localize, understand, collaborate: semantic-aware dragging via intention reasoner. arXiv preprint arXiv:2406.00432 (2024)
Cui, X., et al.: ChatEdit: towards multi-turn interactive facial image editing via dialogue. In: EMNLP (2023)
Deng, Y., et al.: StyTr2: image style transfer with transformers. In: CVPR (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Everaert, M.N., Bocchio, M., Arpa, S., Süsstrunk, S., Achanta, R.: Diffusion in style. In: ICCV (2023)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: ICLR (2022)
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM TOG 41, 1–13 (2022)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)
Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer. In: CVPR (2017)
Goodfellow, I., et al.: Generative adversarial networks. In: CACM (2020)
Gu, B., Fan, H., Zhang, L.: Two birds, one stone: a unified framework for joint learning of image and video style transfers. In: ICCV (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPSW (2021)
Hong, K., et al.: AesPA-Net: aesthetic pattern-aware style transfer networks. In: ICCV (2023)
Hu, E.J., et al.: LoRa: low-rank adaptation of large language models. In: ICLR (2021)
Huang, S., An, J., Wei, D., Luo, J., Pfister, H.: QuantART: quantizing image style transfer towards high visual fidelity. In: CVPR (2023)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
Jia, G., Li, P., He, R.: Theme-aware aesthetic distribution prediction with full-resolution photographs. In: TNNLS (2022)
Jing, Y., et al.: Dynamic instance normalization for arbitrary style transfer. In: AAAI (2020)
Jing, Y.: Stroke controllable fast style transfer with adaptive receptive fields. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 244–260. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_15
Jing, Y., et al.: Learning graph neural networks for image style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. Lecture Notes in Computer Science, vol. 13667, pp. 111–128. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_7
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Ke, Z., Liu, Y., Zhu, L., Zhao, N., Lau, R.W.: Neural preset for color style transfer. In: CVPR (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Krizhevsky, A., et al.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR (2023)
Li, P., Hu, Y., He, R., Sun, Z.: Global and local consistent wavelet-domain age synthesis. TIFS 14, 2943–2957 (2019)
Li, P., Liu, X., Huang, J., Xia, D., Yang, J., Lu, Z.: Progressive generation of 3D point clouds with hierarchical consistency. PR 136, 109200 (2023)
Li, P., Wang, R., Huang, H., He, R., He, Z.: Pluralistic aging diffusion autoencoder. In: ICCV (2023)
Li, P., Wu, X., Hu, Y., He, R., Sun, Z.: M2FPA: a multi-yaw multi-pitch high-quality dataset and benchmark for facial pose analysis. In: ICCV (2019)
Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891 (2023)
Lin, T., et al.: Drafting and revision: Laplacian pyramid network for fast high-quality artistic style transfer. In: CVPR (2021)
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13677, pp. 423–439. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_26
Liu, S., et al.: AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: ICCV (2021)
Mao, J., Wang, X., Aizawa, K.: Guided image synthesis via initial image editing in diffusion model. In: ACM MM (2023)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2021)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR (2023)
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICLR (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: ICML (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Ruiz, N., et al.: HyperDreamBooth: hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949 (2023)
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2021)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Sohn, K., et al.: StyleDrop: text-to-image generation in any style. In: NeurIPS (2023)
Sohn, K., et al.: Learning disentangled prompts for compositional image synthesis. arXiv preprint arXiv:2306.00763 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2020)
Tang, H., et al.: Master: meta style transformer for controllable zero-shot and few-shot artistic style transfer. In: CVPR (2023)
Tang, J., et al.: Tri-clustered tensor completion for social-aware image tag refinement. TPAMI 39, 1662–1674 (2016)
Teng, Q., Wang, R., Cui, X., Li, P., He, Z.: Exploring 3D-aware lifespan face aging via disentangled shape-texture representations. In: ICME (2024)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., Li, Y., Wang, Y., Hu, H., Yang, M.H.: Collaborative distillation for ultra-resolution universal style transfer. In: CVPR (2020)
Wang, R., et al.: StableGarment: garment-centric generation via stable diffusion. arXiv preprint arXiv:2403.10783 (2024)
Wang, R., Li, P., Huang, H., Cao, C., He, R., He, Z.: Learning-to-rank meets language: boosting language-driven ordering alignment for ordinal classification. In: NeurIPS (2023)
Wang, Z., Zhao, L., Xing, W.: StyleDiffusion: controllable disentangled style transfer via diffusion models. In: ICCV (2023)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In: ICCV (2023)
Wen, L., Gao, C., Zou, C.: CAP-VSTNet: content affinity preserved versatile style transfer. In: CVPR (2023)
Wu, X., Hu, Z., Sheng, L., Xu, D.: StyleFormer: real-time arbitrary style transfer via parametric style composition. In: ICCV (2021)
Wu, Z., Zhu, Z., Du, J., Bai, X.: CCPL: contrastive coherence preserving loss for versatile style transfer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 189–206. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_11
Xie, X., Li, Y., Huang, H., Fu, H., Wang, W., Guo, Y.: Artistic style discovery with independent components. In: CVPR (2022)
Xu, W., Long, C., Nie, Y.: Learning dynamic style kernels for artistic style transfer. In: CVPR (2023)
Xu, Z., Sangineto, E., Sebe, N.: StylerDALLE: language-guided style transfer using a vector-quantized tokenizer of a large-scale generative model. In: ICCV (2023)
Yang, S., Hwang, H., Ye, J.C.: Zero-shot contrastive loss for text-guided diffusion image style transfer. In: ICCV (2023)
Zhang, Y., et al.: Inversion-based style transfer with diffusion models. In: CVPR (2023)
Zhang, Y., et al.: Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH (2022)
Zhang, Z., Li, B., Nie, X., Han, C., Guo, T., Liu, L.: Towards consistent video editing with text-to-image diffusion models. In: NeurIPS (2023)
Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for customized text-to-image generation: a regularization-free approach. arXiv preprint arXiv:2305.13579 (2023)
Zhu, M., He, X., Wang, N., Wang, X., Gao, X.: All-to-key attention for arbitrary style transfer. In: ICCV (2023)
Acknowledgement
This research is sponsored by National Natural Science Foundation of China (Grant No. 62306041, U21B2045, 62176025), Beijing Nova Program (Grant No. Z211100002121106, 20230484488, 20230484276), Youth Innovation Promotion Association CAS (Grant No. 2022132), and Beijing Municipal Science & Technology Commission (Z231100007423015).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cui, X., Li, Z., Li, P., Huang, H., Liu, X., He, Z. (2025). INSTASTYLE: Inversion Noise of a Stylized Image is Secretly a Style Adviser. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15109. Springer, Cham. https://doi.org/10.1007/978-3-031-72983-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-72983-6_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72982-9
Online ISBN: 978-3-031-72983-6
eBook Packages: Computer ScienceComputer Science (R0)