Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multi-Model Style-Aware Diffusion Learning for Semantic Image Synthesis

Published: 13 November 2024 Publication History

Abstract

Semantic image synthesis aims to generate images from given semantic layouts, which is a challenging task that requires training models to capture the relationship between layouts and images. Previous works are usually based on Generative Adversarial Networks (GAN) or autoregressive (AR) models. However, the GAN model's training process is unstable, and the AR model’s performance is seriously affected by the independent image encoder and the unidirectional generation bias. Due to the above limitations, these methods tend to synthesize unrealistic, poorly aligned images and only consider single-style image generation. In this paper, we propose a Multi-model Style-aware Diffusion Learning (MSDL) framework for semantic image synthesis, including a training module and a sampling module. In the training module, a layout-to-image model is introduced to transfer the learned knowledge from a model pretrained with massive weak correlated text-image pairs data, making the training process more efficient. In the sampling module, we designed a map-guidance technique and creatively designed a multi-model style-guidance strategy for creating images in multiple styles, e.g., oil painting, Disney Cartoon, and pixel style. We evaluate our method on Cityscapes, ADE20K, and COCO-Stuff, making visual comparisons and computing with multiple metrics such as FID, LPIPS, etc. Experimental results demonstrate that our model is highly competitive, especially in terms of fidelity and diversity.

Supplemental Material

PDF File - Additional Samples Generated from Our Models
The supplemental file contains more samples generated from our models: Figure 1 and 2 contain more pictures from our model trained on ADE20K and COCO-Stuff. Figure 3 shows Comparisons on Cityscapes with more methods including SPADE, CC-FPSE and SCGAN. Figure 4 shows more novelty pictures in different styles with multi-model style-guidance. Figure 5 shows the influence of using random text in the text-to-image model for style guidance. Figure 6 shows the impact of using mismatched textual conditions in the map-to-image model.

References

[1]
Stephan Alaniz, Thomas Hummel, and Zeynep Akata. 2022. Semantic image synthesis with semantically coupled VQ-model. In Proceedings of the ICLR Workshop on Deep Generative Models for Highly Structured Data.
[2]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. Structured denoising diffusion models in discrete state-spaces. Proceedings of the Advances in Neural Information Processing Systems 34 (2021), 17981–17993.
[3]
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. 2022. Analytic-DPM: An analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proceedings of the International Conference on Learning Representations (ICLR). arXiv:2201.06503.
[4]
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. 2021. Conditional image generation with score-based diffusion models. arXiv:2111.13606. Retrieved from https://arxiv.org/abs/2111.13606
[5]
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. COCO-stuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
[6]
JungWoo Chae, Hyunin Cho, Sooyeon Go, Kyungmook Choi, and Youngjung Uh. 2024. Semantic image synthesis with unconditional generator. Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vol. 36.
[7]
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. 2021. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3558–3568.
[8]
Jingwen Chen, Yingwei Pan, Ting Yao, and Tao Mei. 2023. Controlstyle: Text-driven stylized image generation using diffusion priors. In Proceedings of the 31st ACM International Conference on Multimedia, 7540–7548.
[9]
Pei Chen, Yangkang Zhang, Zejian Li, and Lingyun Sun. 2022. Few-shot incremental learning for label-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3697–3707.
[10]
Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1511–1520.
[11]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Zijun Deng, Xiangteng He, and Yuxin Peng. 2023. LFR-GAN: Local feature refinement based generative adversarial network for text-to-image generation. ACM Transactions on Multimedia Computing, Communications, and Applications 19 (Mar 2023), 1–18. DOI:
[13]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34.
[14]
George Eskandar, Mohamed Abdelsamad, Karim Armanious, and Bin Yang. 2023. USIS: Unsupervised semantic image synthesis. Computers & Graphics 111 (2023), 14–23. DOI:
[15]
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12873–12883.
[16]
Fei Fang, Ziqing Li, Fei Luo, and Chunxia Xiao. 2022. Discriminator modification in GAN for text-to-image generation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 1–6.
[17]
Fei Gao, Xingxin Xu, Jun Yu, Meimei Shang, Xiang Li, and Dacheng Tao. 2021. Complementary, heterogeneous and adversarial networks for image-to-image translation. IEEE Transactions on Image Processing 30 (2021), 3487–3498.
[18]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 6840–6851.
[20]
Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In Proceedings of the Advances in Neural Information Processing Systems Workshops.
[21]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1125–1134.
[22]
Zhifeng Kong and Wei Ping. 2021. On fast sampling of diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning Workshop. arXiv:2106.00132.
[23]
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. 2021. Bilateral denoising diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR). arXiv:2108.11514.
[24]
Jiacheng Li, Zhiwei Xiong, and Dong Liu. 2022. Reference-guided landmark image inpainting with deep feature matching. IEEE Transactions on Circuits and Systems for Video Technology 32, 12 (2022), 8422–8435.
[25]
Ke Li and Jitendra Malik. 2018. Implicit maximum likelihood estimation. arXiv:1809.09087. Retrieved from http://arxiv.org/abs/1809.09087
[26]
Ke Li, Tianhao Zhang, and Jitendra Malik. 2019. Diverse image synthesis from semantic layouts via conditional IMLE. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4219–4228.
[27]
Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2023. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 289–299.
[28]
Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. Proceedings of the Advances in Neural Information Processing Systems, Vol. 32.
[29]
Zhengyao Lv, Xiaoming Li, Zhenxing Niu, Bing Cao, and Wangmeng Zuo. 2022. Semantic-shape adaptive feature modulation for semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11214–11223.
[30]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (PMLR). arXiv:2112.10741.
[31]
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In Proceedings of theInternational Conference on Machine Learning (ICML). PMLR, 8162–8171.
[32]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2337–2346.
[33]
Xiaojuan Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. 2018. Semi-parametric image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8808–8816.
[34]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695.
[35]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234–241.
[36]
Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. 2021. Palette: Image-to-image diffusion models. arXiv:2111.05826. Retrieved from https://arxiv.org/abs/2111.05826
[37]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in neural information processing systems (NIPS). arXiv:2205.11487.
[38]
Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda-Pineda, Elisa Ricci, and Nicu Sebe. 2019. increasing image memorability with neural style transfer. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (Jun 2019), Article 42, 22 pages. 1551–6857
[39]
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of theInternational Conference on Machine Learning (ICML). PMLR, 2256–2265.
[40]
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. Proceedings of the International Conference on Learning Representations (ICLR).
[41]
Vadim Sushko, Edgar Schönfeld, Dan Zhang, Juergen Gall, Bernt Schiele, and Anna Khoreva. 2020. You only need adversarial supervision for semantic image synthesis. Proceedings of the International Conference on Learning Representations (ICLR).
[42]
Daniel Stanley Tan, Yong-Xiang Lin, and Kai-Lung Hua. 2020. Incremental learning of multi-domain image-to-image translations. IEEE Transactions on Circuits and Systems for Video Technology 31, 4 (2020), 1526–1539.
[43]
Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Bin Liu, Gang Hua, and Nenghai Yu. 2021. Diverse semantic image synthesis via probability distribution modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7962–7971.
[44]
Hao Tang, Song Bai, and Nicu Sebe. 2020a. Dual attention gans for semantic image synthesis. In Proceedings of the ACM. International Conference on Multimedia, 1994–2002.
[45]
Hao Tang, Dan Xu, Yan Yan, Philip H. S. Torr, and Nicu Sebe. 2020b. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7870–7879.
[46]
Quan Wang, Sheng Li, Xinpeng Zhang, and Guorui Feng. 2022b. Multi-granularity brushstrokes network for universal style transfer. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (Mar 2022), Article 107, 17 pages. DOI:
[47]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8798–8807.
[48]
Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, and Houqiang Li. 2022. Semantic image synthesis via diffusion models. arXiv:2207.00050. Retrieved from https://arxiv.org/abs/2207.00050
[49]
Yi Wang, Lu Qi, Ying-Cong Chen, Xiangyu Zhang, and Jiaya Jia. 2021. Image synthesis via semantic composition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 13749–13758.
[50]
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. 2022. Nüwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of the European conference on computer vision (ECCV). arXiv:2111.12417.
[51]
Cheng Xu, Zejun Chen, Jiajie Mai, Xuemiao Xu, and Shengfeng He. 2023. Pose- and attribute-consistent person image synthesis. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 2s (Feb 2023), Article 81, 21 pages. DOI:
[52]
Serin Yang, Hyunmin Hwang, and Jong Chul Ye. 2023. Zero-shot contrastive loss for text-guided diffusion image style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22873–22882.
[53]
Zichen Yang, Haifeng Liu, and Deng Cai. 2019. On the diversity of conditional image synthesis with semantic layouts. IEEE Transactions on Image Processing 28, 6 (2019), 2898–2907.
[54]
Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated residual networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 472–480.
[55]
Masoumeh Zareapoor and Jie Yang. 2021. Equivariant adversarial network for image-to-image translation. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2s (Jun 2021), Article 73, 14 pages. DOI:
[56]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847.
[57]
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 586–595.
[58]
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ADE20K dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59]
Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5104–5113.
[60]
Zhen Zhu, Zhiliang Xu, Ansheng You, and Xiang Bai. 2020. Semantically multi-modal image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5467–5476.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 11
November 2024
702 pages
EISSN:1551-6865
DOI:10.1145/3613730
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2024
Online AM: 02 August 2024
Accepted: 16 July 2024
Revised: 02 July 2024
Received: 27 April 2023
Published in TOMM Volume 20, Issue 11

Check for updates

Author Tags

  1. Semantic image synthesis
  2. diffusion model
  3. pretrained model

Qualifiers

  • Research-article

Funding Sources

  • National Key R & D Program of China
  • Beijing Municipal Science and Technology
  • National Natural Science Foundation of China
  • Beijing Natural Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 269
    Total Downloads
  • Downloads (Last 12 months)269
  • Downloads (Last 6 weeks)44
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media