Article

GAN-Diffusion Relay Model: Advancing Semantic Image Synthesis

Authors:

Wei LiAuthors Info & Claims

Pattern Recognition and Computer Vision: 7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part IV

Pages 393 - 406

https://doi.org/10.1007/978-981-97-8505-6_28

Published: 07 November 2024 Publication History

Abstract

Semantic image synthesis, involves the transformation of semantic layouts into realistic images, is aimed at comprehending and leveraging given semantic information. Despite recent impressive advancements, challenges persist in terms of fidelity, semantic alignment, and training stability. To enhance the generation quality and semantic alignment in semantic image synthesis, we have reengineered the noise mapping and semantic space embedding, proposing a novel semantic image synthesis model, GAN-Diffusion Relay Model (GDRM), based on GAN and relay diffusion model. Extensive experiments on benchmark datasets validate the effectiveness of our proposed approach, achieving state-of-the-art performance in terms of fidelity (FID) and diversity (LPIPS).

References

[1]

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Liu, M.: Ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arxiv 2022. arXiv preprint arXiv:2211.01324 (2022)

[2]

Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)

[3]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

[4]

Dhariwal P and Nichol A Diffusion models beat gans on image synthesis Adv. Neural. Inf. Process. Syst. 2021 34 8780-8794

[5]

Eastwood, C., Williams, C.K.: A framework for the quantitative evaluation of disentangled representations. In: International Conference on Learning Representations (2018)

[6]

Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial networks. Comm. ACM 63, 139–144 (2014), https://api.semanticscholar.org/CorpusID:1033682

[7]

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)

[8]

Ho J, Jain A, and Abbeel P Denoising diffusion probabilistic models Adv. Neural. Inf. Process. Syst. 2020 33 6840-6851

[9]

Ho J, Saharia C, Chan W, Fleet DJ, Norouzi M, and Salimans T Cascaded diffusion models for high fidelity image generation J. Mach. Learn. Res. 2022 23 47 1-33

[10]

Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: End-to-end diffusion for high resolution images. In: International Conference on Machine Learning, pp. 13213–13232. PMLR (2023)

[11]

IsolaP, Z., Zhou, T., et al.: Image to imagetranslation withconditionaladversarialnetworks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA 1125, 1134 (2017)

[12]

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

[13]

Karras T, Aittala M, Aila T, and Laine S Elucidating the design space of diffusion-based generative models Adv. Neural. Inf. Process. Syst. 2022 35 26565-26577

[14]

Klinker F Exponential moving average versus moving exponential average Math. Semesterber. 2011 58 97-107

[15]

Liu, X., Yin, G., Shao, J., Wang, X., et al.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Adv. Neural Inf. Process. Syst. 32 (2019)

[16]

Lv, Z., Li, X., Niu, Z., Cao, B., Zuo, W.: Semantic-shape adaptive feature modulation for semantic image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11214–11223 (2022)

[17]

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2332–2341 (2019), https://api.semanticscholar.org/CorpusID:81981856

[18]

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. Pmlr (2021)

[19]

Rissanen, S., Heinonen, M., Solin, A.: Generative modelling with inverse heat dissipation. arXiv preprint arXiv:2206.13397 (2022)

[20]

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer (2015)

[21]

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

[22]

Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T, et al. Photorealistic text-to-image diffusion models with deep language understanding Adv. Neural. Inf. Process. Syst. 2022 35 36479-36494

[23]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[24]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

[25]

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

[26]

Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. arXiv preprint arXiv:2012.04781 (2020)

[27]

Tan, Z., Chai, M., Chen, D., Liao, J., Chu, Q., Liu, B., Hua, G., Yu, N.: Diverse semantic image synthesis via probability distribution modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7962–7971 (2021)

[28]

Tan Z, Chen D, Chu Q, Chai M, Liao J, He M, Yuan L, Hua G, and Yu N Efficient semantic image synthesis via class-adaptive normalization IEEE Trans. Pattern Anal. Mach. Intell. 2021 44 9 4852-4866

[29]

Tang, H., Bai, S., Sebe, N.: Dual attention gans for semantic image synthesis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1994–2002 (2020)

[30]

Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., Tang, J.: Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350 (2023)

[31]

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 30 (2017)

[32]

Wang, T., Zhang, T., Zhang, B., Ouyang, H., Chen, D., Chen, Q., Wen, F.: Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952 (2022)

[33]

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)

[34]

Wang, W., Bao, J., Zhou, W., Chen, D., Chen, D., Yuan, L., Li, H.: Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050 (2022)

[35]

Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

[36]

Zhan, F., Yu, Y., Wu, R., Zhang, J., Lu, S., Liu, L., Kortylewski, A., Theobalt, C., Xing, E.: Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592 (2022)

[37]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

[38]

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

[39]

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)

[40]

Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113 (2020)

Index Terms

GAN-Diffusion Relay Model: Advancing Semantic Image Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Image and video acquisition
        Computational photography
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning
    2. Machine learning approaches
      1. Learning latent representations
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-Model Style-Aware Diffusion Learning for Semantic Image Synthesis
Semantic image synthesis aims to generate images from given semantic layouts, which is a challenging task that requires training models to capture the relationship between layouts and images. Previous works are usually based on Generative Adversarial ...
Semantic Image Synthesis for Abdominal CT
Deep Generative Models
Abstract
As a new emerging and promising type of generative models, diffusion models have proven to outperform Generative Adversarial Networks (GANs) in multiple tasks, including image synthesis. In this work, we explore semantic image synthesis for ...
OASIS: Only Adversarial Supervision for Semantic Image Synthesis
Abstract
Despite their recent successes, generative adversarial networks (GANs) for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Previously, additionally employing the VGG-based perceptual ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Pattern Recognition and Computer Vision: 7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part IV

Oct 2024

523 pages

ISBN:978-981-97-8504-9

DOI:10.1007/978-981-97-8505-6

Editors:
Zhouchen Lin
Peking University, Beijing, China
,
Ming-Ming Cheng
Nankai University, Tianjin, China
,
Ran He
Chinese Academy of Sciences, Beijing, China
,
Kurban Ubul
Xinjiang University, Ürümqi, Xinjiang, China
,
Wushouer Silamu
Xinjiang University, Ürümqi, China
,
Hongbin Zha
https://ror.org/02v51f717Peking University, Beijing, China
,
Jie Zhou
Tsinghua University, Beijing, China
,
Cheng-Lin Liu
Chinese Academy of Sciences, Beijing, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 07 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten