Article

Distilling Diffusion Models Into Conditional GANs

Authors:

Connelly Barnes,

Taesung ParkAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVIII

Pages 428 - 447

https://doi.org/10.1007/978-3-031-73390-1_25

Published: 31 October 2024 Publication History

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model’s ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model’s latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models – SDXL-Turbo and SDXL-Lightning – on the COCO benchmark.

References

[1]

Ascher, U.M., Petzold, L.R.: Computer Methods for Ordinary Differential Equations and Differential-algebraic Equations. SIAM (1998)

[2]

Atkinson, K.: An Introduction to Numerical Analysis. Wiley (1991)

[3]

Berthelot, D., et al.: TRACT: denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248 (2023)

[4]

Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

[5]

Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (ICLR) (2019)

[6]

Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[7]

Chen, J., et al.: Pixart-

α

: fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (ICLR) (2024)

[8]

Chen, Y.H., et al.: Speed is all you need: on-device acceleration of large diffusion models via GPU-aware optimizations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[9]

Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[10]

DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)

[11]

Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)

[12]

Fu, S., et al.: DreamSim: learning new dimensions of human visual similarity using synthetic data. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

[13]

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-or, D.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (ICLR) (2023)

[14]

Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)

[15]

Goodfellow, I., et al.: Generative adversarial nets. In: Conference on Neural Information Processing Systems (NeurIPS) (2014)

[16]

Gu, J., Zhai, S., Zhang, Y., Liu, L., Susskind, J.M.: BOOT: data-free distillation of denoising diffusion models with bootstrapping. In: ICML 2023 Workshop on Structured Probabilistic Inference and Generative Modeling (2023)

[17]

Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. In: International Conference on Learning Representations (ICLR) (2024)

[18]

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: International Conference on Learning Representations (ICLR) (2023)

[19]

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

[20]

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)

[21]

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Advances in Neural Information Processing Systems Deep Learning and Representation Learning Workshop (2015)

[22]

Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

[23]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)

[24]

Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Conference on Neural Information Processing Systems (NeurIPS) Workshop (2022)

[25]

Huber, P.J.: Robust estimation of a location parameter. In: Breakthroughs in Statistics: Methodology and distribution (1992)

[26]

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial examples are not bugs, they are features. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)

[27]

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

[28]

Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV) (2016)

[29]

Kang, M., Shim, W., Cho, M., Park, J.: Rebooting ACGAN: auxiliary classifier GANs with stable training. In: Conference on Neural Information Processing Systems (NeurIPS) (2021)

[30]

Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[31]

Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)

[32]

Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)

[33]

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[34]

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[35]

Kettunen, M., Härkönen, E., Lehtinen, J.: E-LPIPS: robust perceptual image similarity via random transformation ensembles. arXiv preprint arXiv:1906.03973 (2019)

[36]

Kim, D., et al.: Consistency trajectory models: learning probability flow ODE trajectory of diffusion. In: International Conference on Learning Representations (ICLR) (2023)

[37]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

[38]

Krizhevsky, A.: Learning multiple layers of features from tiny images. Ph.D. thesis, University of Toronto (2012)

[39]

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[40]

Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)

[41]

Li, M., Lin, J., Meng, C., Ermon, S., Han, S., Zhu, J.Y.: Efficient spatially sparse inference for conditional GANs and diffusion models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)

[42]

Li, Y., et al.: SnapFusion: text-to-image diffusion model on mobile devices within two seconds. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

[43]

Lin, C.H., et al.: Magic3D: high-resolution text-to-3d content creation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[44]

Lin, S., Wang, A., Yang, X.: SDXL-Lightning: progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929 (2024)

[45]

Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755

[46]

Liu, M.Y., et al.: Few-shot Unsupervised Image-to-Image Translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)

[47]

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

[48]

Liu, X., Zhang, X., Ma, J., Peng, J., Liu, Q.: InstaFlow: one step is enough for high-quality diffusion-based text-to-image generation. In: International Conference on Learning Representations (ICLR) (2024)

[49]

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-Solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)

[50]

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-Solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)

[51]

Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)

[52]

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

[53]

Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2022)

[54]

Meng, C., et al.: On distillation of guided diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[55]

Mescheder, L., Nowozin, S., Geiger, A.: Which training methods for GANs do actually Converge? In: International Conference on Machine Learning (ICML) (2018)

[56]

Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

[57]

Miyato, T., Koyama, M.: cGANs with projection discriminator. In: International Conference on Learning Representations (ICLR) (2018)

[58]

Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

[59]

Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning (ICML) (2017)

[60]

Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: European Conference on Computer Vision (ECCV) (2020)

[61]

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[62]

Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)

[63]

Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations (ICLR) (2024)

[64]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: International Conference on Learning Representations (ICLR) (2023)

[65]

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

[66]

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning (ICML) (2016)

[67]

Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)

[68]

Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

[69]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

[70]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: Stable diffusion. https://github.com/CompVis/stable-diffusion. Accessed 06 Nov 2022

[71]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: Stable diffusion 1.5. https://github.com/runwayml/stable-diffusion. Accessed 06 Nov 2022

[72]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[73]

Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)

[74]

Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (ICLR) (2022)

[75]

Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015 (2024)

[76]

Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis. In: International Conference on Machine Learning (ICML) (2023)

[77]

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)

[78]

Sauer, A., Schwarz, K., Geiger, A.: StyleGAN-xl: Scaling StyleGAN to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings (2022)

[79]

Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)

[80]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[81]

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (ICML) (2015)

[82]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021)

[83]

Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: International Conference on Learning Representations (ICLR) (2024)

[84]

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning (ICML) (2023)

[85]

Song, Y., Garg, S., Shi, J., Ermon, S.: Sliced score matching: A scalable approach to density and score estimation. In: Uncertainty in Artificial Intelligence. PMLR (2020)

[86]

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2021)

[87]

Vincent P A connection between score matching and denoising autoencoders Neural Comput. 2011 23 1661-1674

Digital Library

[88]

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[89]

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[90]

Wang, Z., Zheng, H., He, P., Chen, W., Zhou, M.: Diffusion-GAN: training GANs with Diffusion. In: International Conference on Learning Representations (ICLR) (2023)

[91]

Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion GANs. In: International Conference on Learning Representations (ICLR) (2022)

[92]

Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[93]

Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: UFOGen: you forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257 (2023)

[94]

Yin, T., et al.: One-step diffusion with distribution matching distillation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

[95]

Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)

[96]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE International Conference on Computer Vision (ICCV) (2023)

[97]

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[98]

Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2021)

[99]

Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for data-efficient GAN training. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)

[100]

Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sampling of diffusion models via operator learning. In: International Conference on Machine Learning (ICML) (2023)

[101]

Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVIII

Sep 2024

570 pages

ISBN:978-3-031-73389-5

DOI:10.1007/978-3-031-73390-1

Editors:
Aleš Leonardis
https://ror.org/03angcq70University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
https://ror.org/05n911h24Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
https://ror.org/00hx57361Princeton University, Princeton, NJ, USA
,
Torsten Sattler
https://ror.org/03kqpb082Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
https://ror.org/02nwvxz07École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 31 October 2024

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten