research-article

Multi-Model Style-Aware Diffusion Learning for Semantic Image Synthesis

Authors:

Jinqiao WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 11

Article No.: 348, Pages 1 - 21

https://doi.org/10.1145/3686155

Published: 13 November 2024 Publication History

Get Access

Abstract

Semantic image synthesis aims to generate images from given semantic layouts, which is a challenging task that requires training models to capture the relationship between layouts and images. Previous works are usually based on Generative Adversarial Networks (GAN) or autoregressive (AR) models. However, the GAN model's training process is unstable, and the AR model’s performance is seriously affected by the independent image encoder and the unidirectional generation bias. Due to the above limitations, these methods tend to synthesize unrealistic, poorly aligned images and only consider single-style image generation. In this paper, we propose a Multi-model Style-aware Diffusion Learning (MSDL) framework for semantic image synthesis, including a training module and a sampling module. In the training module, a layout-to-image model is introduced to transfer the learned knowledge from a model pretrained with massive weak correlated text-image pairs data, making the training process more efficient. In the sampling module, we designed a map-guidance technique and creatively designed a multi-model style-guidance strategy for creating images in multiple styles, e.g., oil painting, Disney Cartoon, and pixel style. We evaluate our method on Cityscapes, ADE20K, and COCO-Stuff, making visual comparisons and computing with multiple metrics such as FID, LPIPS, etc. Experimental results demonstrate that our model is highly competitive, especially in terms of fidelity and diversity.

Supplemental Material

PDF File - Additional Samples Generated from Our Models

The supplemental file contains more samples generated from our models: Figure 1 and 2 contain more pictures from our model trained on ADE20K and COCO-Stuff. Figure 3 shows Comparisons on Cityscapes with more methods including SPADE, CC-FPSE and SCGAN. Figure 4 shows more novelty pictures in different styles with multi-model style-guidance. Figure 5 shows the influence of using random text in the text-to-image model for style guidance. Figure 6 shows the impact of using mismatched textual conditions in the map-to-image model.

Download
1.86 MB

References

[1]

Stephan Alaniz, Thomas Hummel, and Zeynep Akata. 2022. Semantic image synthesis with semantically coupled VQ-model. In Proceedings of the ICLR Workshop on Deep Generative Models for Highly Structured Data.

Abstract

Supplemental Material

References

Index Terms

Recommendations

Palette: Image-to-Image Diffusion Models

Diffusion Models: A Comprehensive Survey of Methods and Applications

Activity-Oriented Production Promotion Utility Maximization in Metaverse Social Networks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations