Abstract
Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and an innovative spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.
H. Gao and M. Gao—Equal Contribution.
Project Page: https://air-discover.github.io/SCP-Diff/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The real images in Fig. 1 are all in the left.
- 2.
This research is supported by Tsinghua University - Mercedes Benz Institute for Sustainable Mobility.
References
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Balaji, Y., et al.: eDiffi: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
Caesar, H., Uijlings, J., Ferrari, V.: COCO-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: MasaCtrl: tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023)
Chen, M., et al.: Ultraman: single image 3D human reconstruction with ultra speed and detail. arXiv preprint arXiv:2403.12028 (2024)
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: DiffEdit: diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
Duan, Y., Guo, X., Zhu, Z.: DiffusionDepth: diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Everaert, M.N., Fitsios, A., Bocchio, M., Arpa, S., Süsstrunk, S., Achanta, R.: Exploiting the signal-leak bias in diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4025–4034 (2024)
Gao, H.a., et al.: From semi-supervised to omni-supervised room layout estimation using point clouds. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2803–2810. IEEE (2023)
Gao, H.a., Tian, B., Li, P., Zhao, H., Zhou, G.: DQS3D: densely-matched quantization-aware semi-supervised 3D detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21905–21915 (2023)
Ge, S., et al.: Preserve Your Own Correlation: a noise prior for video diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941 (2023)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Jiang, Z., et al.: P-MapNet: far-seeing map generator enhanced by both SDMap and HDMap priors. arXiv preprint arXiv:2403.10521 (2024)
Li, W., et al.: FairDiff: fair segmentation with point-image diffusion. arXiv preprint arXiv:2407.06250 (2024)
Lin, S., Liu, B., Li, J., Yang, X.: Common diffusion noise schedules and sample steps are flawed. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5404–5411 (2024)
Liu, X., Yin, G., Shao, J., Wang, X., et al.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Lu, M., Zhao, H., Yao, A., Chen, Y., Xu, F., Zhang, L.: A closed-form solution to universal style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5952–5961 (2019)
Luo, W., Yang, S., Wang, H., Long, B., Zhang, W.: Context-consistent semantic image editing with style-preserved modulation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13677, pp. 561–578. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_34
Luo, W., Yang, S., Zhang, X., Zhang, W.: SIEDOB: semantic image editing by disentangling object and background. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1868–1878 (2023)
Lv, Z., Li, X., Niu, Z., Cao, B., Zuo, W.: Semantic-shape adaptive feature modulation for semantic image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11214–11223 (2022)
Lv, Z., Wei, Y., Zuo, W., Wong, K.Y.K.: Place: Adaptive layout-semantic fusion for semantic image synthesis (2024)
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
Ntavelis, E., Romero, A., Kastanis, I., Van Gool, L., Timofte, R.: SESAME: semantic editing of scenes by adding, manipulating or erasing objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 394–411. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_24
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Qi, X., Chen, Q., Jia, J., Koltun, V.: Semi-parametric image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816 (2018)
Qiu, H., et al.: FREENOISE: tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169 (2023)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Shi, Y., Liu, X., Wei, Y., Wu, Z., Zuo, W.: Retrieval-based spatially adaptive normalization for semantic image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11224–11233 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Sushko, V., Schönfeld, E., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: OASIS: only adversarial supervision for semantic image synthesis. Int. J. Comput. Vision 130(12), 2903–2923 (2022)
Tan, Z., et al.: Diverse semantic image synthesis via probability distribution modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7962–7971 (2021)
Tan, Z., et al.: Efficient semantic image synthesis via class-adaptive normalization. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4852–4866 (2021)
Tang, H., Bai, S., Sebe, N.: Dual attention GANs for semantic image synthesis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1994–2002 (2020)
Tang, H., et al.: Edge guided GANs with contrastive learning for semantic image synthesis. arXiv preprint arXiv:2003.13898 (2020)
Tang, H., Sebe, N.: Layout-to-image translation with double pooling generative adversarial networks. IEEE Trans. Image Process. 30, 7903–7913 (2021)
Tang, H., Shao, L., Torr, P.H., Sebe, N.: Local and global GANs with semantic-aware Upsampling for image generation. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 768–784 (2022)
Tang, H., Sun, G., Sebe, N., Van Gool, L.: Edge guided GANs with multi-scale contrastive learning for semantic image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 14435–14452 (2023)
Tang, H., Torr, P.H., Sebe, N.: Multi-channel attention selection GANs for guided image-to-image translation. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 6055–6071 (2022)
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J.J., Yan, Y.: Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2417–2426 (2019)
Tang, H., Xu, D., Yan, Y., Torr, P.H., Sebe, N.: Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7870–7879 (2020)
Tian, B., Liu, M., Gao, H.a., Li, P., Zhao, H., Zhou, G.: Unsupervised road anomaly detection with language anchors. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 7778–7785. IEEE (2023)
Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: \( P+ \): extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)
Wang, T., et al.: Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952 (2022)
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Wang, W., et al.: Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050 (2022)
Wang, Y., Qi, L., Chen, Y.C., Zhang, X., Jia, J.: Image synthesis via semantic composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13749–13758 (2021)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. vol. 2, pp. 1398–1402. IEEE (2003)
Wu, T., Si, C., Jiang, Y., Huang, Z., Liu, Z.: FreeInit: bridging initialization Gap in video diffusion models. arXiv preprint arXiv:2312.07537 (2023)
Xue, H., Huang, Z., Sun, Q., Song, L., Zhang, W.: Freestyle layout-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14256–14266 (2023)
Yang, L., Xu, X., Kang, B., Shi, Y., Zhao, H.: FreeMask: synthetic images with dense annotations make stronger segmentation models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Zhang, J., Chang, S.Y., Li, K., Forsyth, D.: Preserving image properties through initializations in diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5242–5250 (2024)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Zheng, Y., et al.: STEPS: joint self-supervised nighttime image enhancement and depth estimation. arXiv preprint arXiv:2302.01334 (2023)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Zhu, Z., Xu, Z., You, A., Bai, X.: Semantically multi-modal image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5467–5476 (2020)
Acknowledgements
This research is supported by Tsinghua University - Mercedes Benz Institute for Sustainable Mobility.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gao, Ha. et al. (2025). SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-73411-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)