Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-73390-1_25guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Distilling Diffusion Models Into Conditional GANs

Published: 31 October 2024 Publication History

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model’s ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model’s latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models – SDXL-Turbo and SDXL-Lightning – on the COCO benchmark.

References

[1]
Ascher, U.M., Petzold, L.R.: Computer Methods for Ordinary Differential Equations and Differential-algebraic Equations. SIAM (1998)
[2]
Atkinson, K.: An Introduction to Numerical Analysis. Wiley (1991)
[3]
Berthelot, D., et al.: TRACT: denoising diffusion models with transitive closure time-distillation. arXiv preprint arXiv:2303.04248 (2023)
[4]
Blattmann, A., et al.: Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
[5]
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (ICLR) (2019)
[6]
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[7]
Chen, J., et al.: Pixart-α: fast training of diffusion transformer for photorealistic text-to-image synthesis. In: International Conference on Learning Representations (ICLR) (2024)
[8]
Chen, Y.H., et al.: Speed is all you need: on-device acceleration of large diffusion models via GPU-aware optimizations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[9]
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[10]
DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
[11]
Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)
[12]
Fu, S., et al.: DreamSim: learning new dimensions of human visual similarity using synthetic data. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)
[13]
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-or, D.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (ICLR) (2023)
[14]
Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
[15]
Goodfellow, I., et al.: Generative adversarial nets. In: Conference on Neural Information Processing Systems (NeurIPS) (2014)
[16]
Gu, J., Zhai, S., Zhang, Y., Liu, L., Susskind, J.M.: BOOT: data-free distillation of denoising diffusion models with bootstrapping. In: ICML 2023 Workshop on Structured Probabilistic Inference and Generative Modeling (2023)
[17]
Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. In: International Conference on Learning Representations (ICLR) (2024)
[18]
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.: Prompt-to-prompt image editing with cross-attention control. In: International Conference on Learning Representations (ICLR) (2023)
[19]
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
[20]
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)
[21]
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Advances in Neural Information Processing Systems Deep Learning and Representation Learning Workshop (2015)
[22]
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
[23]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)
[24]
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Conference on Neural Information Processing Systems (NeurIPS) Workshop (2022)
[25]
Huber, P.J.: Robust estimation of a location parameter. In: Breakthroughs in Statistics: Methodology and distribution (1992)
[26]
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial examples are not bugs, they are features. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)
[27]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
[28]
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV) (2016)
[29]
Kang, M., Shim, W., Cho, M., Park, J.: Rebooting ACGAN: auxiliary classifier GANs with stable training. In: Conference on Neural Information Processing Systems (NeurIPS) (2021)
[30]
Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[31]
Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
[32]
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)
[33]
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
[34]
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[35]
Kettunen, M., Härkönen, E., Lehtinen, J.: E-LPIPS: robust perceptual image similarity via random transformation ensembles. arXiv preprint arXiv:1906.03973 (2019)
[36]
Kim, D., et al.: Consistency trajectory models: learning probability flow ODE trajectory of diffusion. In: International Conference on Learning Representations (ICLR) (2023)
[37]
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
[38]
Krizhevsky, A.: Learning multiple layers of features from tiny images. Ph.D. thesis, University of Toronto (2012)
[39]
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[40]
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)
[41]
Li, M., Lin, J., Meng, C., Ermon, S., Han, S., Zhu, J.Y.: Efficient spatially sparse inference for conditional GANs and diffusion models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
[42]
Li, Y., et al.: SnapFusion: text-to-image diffusion model on mobile devices within two seconds. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)
[43]
Lin, C.H., et al.: Magic3D: high-resolution text-to-3d content creation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[44]
Lin, S., Wang, A., Yang, X.: SDXL-Lightning: progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929 (2024)
[45]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[46]
Liu, M.Y., et al.: Few-shot Unsupervised Image-to-Image Translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)
[47]
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
[48]
Liu, X., Zhang, X., Ma, J., Peng, J., Liu, Q.: InstaFlow: one step is enough for high-quality diffusion-based text-to-image generation. In: International Conference on Learning Representations (ICLR) (2024)
[49]
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-Solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
[50]
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-Solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
[51]
Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)
[52]
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
[53]
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2022)
[54]
Meng, C., et al.: On distillation of guided diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[55]
Mescheder, L., Nowozin, S., Geiger, A.: Which training methods for GANs do actually Converge? In: International Conference on Machine Learning (ICML) (2018)
[56]
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
[57]
Miyato, T., Koyama, M.: cGANs with projection discriminator. In: International Conference on Learning Representations (ICLR) (2018)
[58]
Mou, C., et al.: T2I-Adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
[59]
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning (ICML) (2017)
[60]
Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: European Conference on Computer Vision (ECCV) (2020)
[61]
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
[62]
Park, T., et al.: Swapping autoencoder for deep image manipulation. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)
[63]
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. In: International Conference on Learning Representations (ICLR) (2024)
[64]
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: International Conference on Learning Representations (ICLR) (2023)
[65]
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
[66]
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning (ICML) (2016)
[67]
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)
[68]
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
[69]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
[70]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: Stable diffusion. https://github.com/CompVis/stable-diffusion. Accessed 06 Nov 2022
[71]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: Stable diffusion 1.5. https://github.com/runwayml/stable-diffusion. Accessed 06 Nov 2022
[72]
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[73]
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
[74]
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (ICLR) (2022)
[75]
Sauer, A., Boesel, F., Dockhorn, T., Blattmann, A., Esser, P., Rombach, R.: Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015 (2024)
[76]
Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: StyleGAN-T: unlocking the power of GANs for fast large-scale text-to-image synthesis. In: International Conference on Machine Learning (ICML) (2023)
[77]
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042 (2023)
[78]
Sauer, A., Schwarz, K., Geiger, A.: StyleGAN-xl: Scaling StyleGAN to large diverse datasets. In: ACM SIGGRAPH 2022 Conference Proceedings (2022)
[79]
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Conference on Neural Information Processing Systems (NeurIPS) (2022)
[80]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[81]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (ICML) (2015)
[82]
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021)
[83]
Song, Y., Dhariwal, P.: Improved techniques for training consistency models. In: International Conference on Learning Representations (ICLR) (2024)
[84]
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: International Conference on Machine Learning (ICML) (2023)
[85]
Song, Y., Garg, S., Shi, J., Ermon, S.: Sliced score matching: A scalable approach to density and score estimation. In: Uncertainty in Artificial Intelligence. PMLR (2020)
[86]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (ICLR) (2021)
[87]
Vincent P A connection between score matching and denoising autoencoders Neural Comput. 2011 23 1661-1674
[88]
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[89]
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[90]
Wang, Z., Zheng, H., He, P., Chen, W., Zhou, M.: Diffusion-GAN: training GANs with Diffusion. In: International Conference on Learning Representations (ICLR) (2023)
[91]
Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion GANs. In: International Conference on Learning Representations (ICLR) (2022)
[92]
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[93]
Xu, Y., Zhao, Y., Xiao, Z., Hou, T.: UFOGen: you forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257 (2023)
[94]
Yin, T., et al.: One-step diffusion with distribution matching distillation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
[95]
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)
[96]
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE International Conference on Computer Vision (ICCV) (2023)
[97]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[98]
Zhao, S., et al.: Large scale image completion via co-modulated generative adversarial networks. In: International Conference on Learning Representations (ICLR) (2021)
[99]
Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for data-efficient GAN training. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)
[100]
Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sampling of diffusion models via operator learning. In: International Conference on Machine Learning (ICML) (2023)
[101]
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (ICCV) (2017)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXVIII
Sep 2024
570 pages
ISBN:978-3-031-73389-5
DOI:10.1007/978-3-031-73390-1
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 31 October 2024

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media