Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

USIS: : Unsupervised Semantic Image Synthesis

Published: 01 April 2023 Publication History

Abstract

Semantic Image Synthesis (SIS) is a subclass of I2I (I2) translation where a photorealistic image is synthesized from a segmentation mask. SIS has mainly been addressed as a supervised problem. However, state-of-the-art methods depend on a massive amount of labeled data and cannot be applied in an unpaired setting. On the other hand, generic unpaired I2I frameworks underperform in comparison. In this work, we propose a new framework, Unsupervised Semantic Image Synthesis (USIS), as a first step towards closing the performance gap between paired and unpaired settings. We design a simple and effective learning scheme that combines the fragmented benefits of cycle losses and relationship preservation constraints. Then, we make the discovery that, contrary to I2I translation, discriminator design is crucial for label-to-image translation. To this end, we design a new discriminator with a wavelet-based encoder and a decoder to reconstruct the real images. The self-supervised reconstruction loss in the decoder prevents the encoder from overfitting on a few wavelet coefficients. We test our methodology on 3 challenging datasets and set a new standard for unpaired SIS. The generated images demonstrate significantly better diversity, quality and multimodality.

Graphical abstract

Display Omitted

Highlights

We propose an unpaired model for semantic image synthesis.
An unsupervised scheme with a one-sided cycle loss is proposed to preserve alignment.
We exploit a whole-image wavelet-based discriminator for class appearance matching.
Moreover, we design a decoder on top of the discriminator to regularize the training.
A new standard for unpaired SIS is achieved on ADE20K, Cityscapes and COCO-stuff.

References

[1]
Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Conference on computer vision and pattern recognition. 2018.
[2]
Park T, Liu M-Y, Wang T-C, Zhu J-Y. Semantic image synthesis with spatially-adaptive normalization. In: Conference on computer vision and pattern recognition. 2019.
[3]
Park T, Efros AA, Zhang R, Zhu J-Y. Contrastive Learning for Unpaired Image-to-Image Translation. In: European conference on computer vision. 2020.
[4]
Schönfeld E, Sushko V, Zhang D, Gall J, Schiele B, Khoreva A. You Only Need Adversarial Supervision for Semantic Image Synthesis. In: International conference on learning representations. 2021.
[5]
Pharr M., Jakob W., Humphreys G., Physically based rendering: From theory to implementation, 3rd ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ISBN 0128006455, 2016.
[6]
Sankaranarayanan S, Balaji Y, Jain A, Lim S-N, Chellappa R. Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. 2018, p. 3752–61.
[7]
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, et al. The cityscapes dataset for semantic urban scene understanding. In: Conference on computer vision and pattern recognition. 2016.
[8]
Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: International conference on computer vision. 2017.
[9]
Huang X, Liu M-Y, Belongie S, Kautz J. Multimodal unsupervised image-to-image translation. In: European conference on computer vision. 2018.
[10]
Lee H-Y, Tseng H-Y, Huang J-B, Singh MK, Yang M-H. Diverse Image-to-Image Translation via Disentangled Representation. In: European conference on computer vision. 2018.
[11]
Fu H, Gong M, Wang C, Batmanghelich K, Zhang K, Tao D. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In: Conference on computer vision and pattern recognition. 2019.
[12]
Benaim S, Wolf L. One-sided unsupervised domain mapping. In: Advances in neural information processing systems. 2017.
[13]
Gal R., Cohen D., Bermano A.H., Cohen-Or D., SWAGAN: A style-based wavelet-driven generative model, 2021, arXiv abs/2102.06108.
[14]
Caesar H, Uijlings J, Ferrari V. Coco-stuff: Thing and stuff classes in context. In: Conference on computer vision and pattern recognition. 2018.
[15]
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Scene parsing through ADE20K dataset. In: Conference on computer vision and pattern recognition. 2017.
[16]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: NIPS. 2014.
[17]
Karras T., Aila T., Laine S., Lehtinen J., Progressive growing of GANs for improved quality, stability, and variation, 2018, arXiv abs/1710.10196.
[18]
Karras T, Laine S, Aila T. A Style-Based Generator Architecture for Generative Adversarial Networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 4396–405.
[19]
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and Improving the Image Quality of StyleGAN. In: 2020 IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 8107–16.
[20]
Liu B, Zhu Y, Song K, Elgammal A. Towards faster and stabilized GAN training for high-fidelity few-shot image synthesis. In: International conference on learning representations. 2020.
[21]
Sauer A., Chitta K., Müller J., Geiger A., Projected GANs converge faster, Adv Neural Inf Process Syst 34 (2021) 17480–17492.
[22]
Sauer A, Schwarz K, Geiger A. Stylegan-XL: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. 2022, p. 1–10.
[23]
Mescheder LM, Geiger A, Nowozin S. Which Training Methods for GANs do actually Converge?. In: ICML. 2018.
[24]
Miyato T., Koyama M., cGANs with projection discriminator, in: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, conference track proceedings, OpenReview.net, 2018, URL: https://openreview.net/forum?id=ByS1VpgRZ.
[25]
Brock A., Donahue J., Simonyan K., Large scale GAN training for high fidelity natural image synthesis, 2018, arXiv:1809.11096.
[26]
Reed SE, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative Adversarial Text to Image Synthesis. In: ICML. 2016.
[27]
Xu T, Zhang P, Huang Q, Zhang H, GAN Z, Huang X, et al. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 1316–24.
[28]
Hong S, Yang D, Choi J, Lee H. Inferring semantic layout for hierarchical text-to-image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 7986–94.
[29]
Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Conference on computer vision and pattern recognition. 2017.
[30]
Tang H, Xu D, Yan Y, Torr PH, Sebe N. Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation. In: Conference on computer vision and pattern recognition. 2020.
[31]
Chen Q, Koltun V. Photographic image synthesis with cascaded refinement networks. In: International conference on computer vision. 2017.
[32]
Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. 2016.
[33]
Gatys LA, Ecker AS, Bethge M. Image Style Transfer Using Convolutional Neural Networks. In: Conference on computer vision and pattern recognition. 2016.
[34]
Tan Z., Chen D., Chu Q., Chai M., Liao J., He M., et al., Rethinking spatially-adaptive normalization, 2020, arXiv abs/2004.02867.
[35]
Liu X, Yin G, Shao J, Wang X, et al. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Advances in neural information processing systems. 2019.
[36]
Zhu P, Abdal R, Qin Y, Wonka P. SEAN: Image Synthesis with semantic region-Adaptive Normalization. In: 2020 IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 5103–12.
[37]
Souly N, Spampinato C, Shah M. Semi supervised semantic segmentation using generative adversarial network. In: International conference on computer vision. 2017.
[38]
Schönfeld E, Schiele B, Khoreva A. A U-Net Based Discriminator for Generative Adversarial Networks. In: Conference on computer vision and pattern recognition. 2020.
[39]
Yi Z, Zhang H, Tan P, Gong M. Dualgan: Unsupervised dual learning for image-to-image translation. In: International conference on computer vision. 2017.
[40]
Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In: Conference on computer vision and pattern recognition. 2018.
[41]
Liu M-Y, Breuel T, Kautz J. Unsupervised image-to-image translation networks. In: Advances in neural information processing systems. 2017.
[42]
Almahairi A, Rajeswar S, Sordoni A, Bachman P, Courville A. Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data. In: International conference on machine learning. 2018.
[43]
Zhu J-Y, Zhang R, Pathak D, Darrell T, Efros AA, Wang O, et al. Toward multimodal image-to-image translation. In: Advances in neural information processing systems. 2017.
[44]
Gokaslan A, Ramanujan V, Ritchie D, In Kim K, Tompkin J. Improving shape deformation in unsupervised image-to-image translation. In: European conference on computer vision. 2018.
[45]
Liang X, Zhang H, Lin L, Xing E. Generative semantic manipulation with mask-contrasting GAN. In: European conference on computer vision. 2018.
[46]
Tang H, Xu D, Sebe N, Yan Y. Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation. In: International joint conference on neural networks. 2019.
[47]
Wu W, Cao K, Li C, Qian C, Loy CC. Transgaga: Geometry-aware unsupervised image-to-image translation. In: Conference on computer vision and pattern recognition. 2019.
[48]
Zhang R, Pfister T, Li J. Harmonic unpaired image-to-image translation. In: International conference on learning representations. 2019.
[49]
Taigman Y, Polyak A, Wolf L. Unsupervised Cross-Domain Image Generation. In: International conference on learning representations. 2017.
[50]
Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R. Learning from simulated and unsupervised images through adversarial training. In: Conference on computer vision and pattern recognition. 2017.
[51]
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Conference on computer vision and pattern recognition. 2017.
[52]
Amodio M, Krishnaswamy S. Travelgan: Image-to-image translation by transformation vector learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2019, p. 8983–92.
[53]
Chen Y, Li G, Jin C, Liu S, Li TH. SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains. In: AAAI. 2021.
[54]
Durall R, Keuper M, Keuper J. Watch Your Up-Convolution: CNN Based Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions. In: 2020 IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 7887–96.
[55]
Dzanic T., Witherden F., Fourier spectrum discrepancies in deep network generated images, 2020, arXiv abs/1911.06465.
[56]
Schwarz K., Liao Y., Geiger A., On the frequency bias of generative models, Adv Neural Inf Process Syst 34 (2021) 18126–18136.
[57]
Bracewell R.N., Bracewell R.N., The fourier transform and its applications, vol. 31999, McGraw-Hill, New York, 1986.
[58]
Daubechies I., The wavelet transform, time-frequency localization and signal analysis, IEEE Trans Inform Theory 36 (1990) 961–1005.
[59]
Daubechies I., Ten lectures on wavelets, Society for Industrial and Applied Mathematics, 1992,. URL: https://epubs.siam.org/doi/abs/10.1137/1.9781611970104.
[60]
Gao X, Xiong H. A hybrid wavelet convolution network with sparse-coding for image super-resolution. In: 2016 IEEE international conference on image processing. 2016, p. 1439–43.
[61]
Liu P., Zhang H., Lian W., Zuo W., Multi-level wavelet convolutional neural networks, IEEE Access 7 (2019) 74973–74985.
[62]
Williams T, Li RY. Wavelet Pooling for Convolutional Neural Networks. In: ICLR. 2018.
[63]
Yoo J, Uh Y, Chun S, Kang B, Ha J-W. Photorealistic Style Transfer via Wavelet Transforms. In: 2019 IEEE/CVF international conference on computer vision. 2019, p. 9035–44.
[64]
Liu L., Liu J., Yuan S., Slabaugh G., Leonardis A., gang Zhou W., et al., Wavelet-based dual-branch network for image demoireing, 2020, arXiv abs/2007.07173.
[65]
Kang E., Min J., Ye J.C., A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction, Med Phys 44 (2017) e360–e375.
[66]
Liu Y, Li Q, Sun Z. Attribute-Aware Face Aging With Wavelet-Based Generative Adversarial Networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition. 2019, p. 11869–78.
[67]
Zhang Q, Wang H, Du T, Yang S, Wang Y, Xing Z, et al. Super-resolution reconstruction algorithms based on fusion of deep learning mechanism and wavelet. In: AIPR ’19. 2019.
[68]
Wang J., Deng X., Xu M., Chen C., Song Y., Multi-level wavelet-based generative adversarial network for perceptual quality enhancement of compressed video, 2020, arXiv abs/2008.00499.
[69]
Huang H., He R., Sun Z., Tan T., Wavelet domain generative adversarial network for multi-scale face hallucination, Int J Comput Vis 127 (2019) 763–784.
[70]
Drucker H., LeCun Y., Improving generalization performance using double backpropagation, IEEE Trans Neural Netw 3 6 (1992) 991–997.
[71]
Ross AS, Doshi-Velez F. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients. In: AAAI. 2018.
[72]
Yi R, Liu Y-J, Lai Y-K, Rosin PL. Unpaired portrait drawing generation via asymmetric cycle mapping. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, p. 8217–25.
[73]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, in: Bengio Y., LeCun Y. (Eds.), 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings, 2015, URL: http://arxiv.org/abs/1412.6980.
[74]
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In: Advances in neural information processing systems. 2017.
[75]
Yu F, Koltun V, Funkhouser T. Dilated residual networks. In: Conference on computer vision and pattern recognition. 2017.
[76]
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: International conference on learning representations. 2015.
[77]
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J. Unified perceptual parsing for scene understanding. In: European conference on computer vision. 2018.
[78]
Qi X, Chen Q, Jia J, Koltun V. Semi-parametric image synthesis. In: Conference on computer vision and pattern recognition. 2018.
[79]
Richter S.R., Vineet V., Roth S., Koltun V., Playing for data: Ground truth from computer games, in: Leibe B., Matas J., Sebe N., Welling M. (Eds.), European conference on computer vision, in: LNCS, vol. 9906, Springer International Publishing, 2016, pp. 102–118.
[80]
Ramesh A., Dhariwal P., Nichol A., Chu C., Chen M., Hierarchical text-conditional image generation with clip latents, 2022, arXiv preprint arXiv:2204.06125.
[81]
Dhariwal P., Nichol A., Diffusion models beat GANs on image synthesis, Adv Neural Inf Process Syst 34 (2021) 8780–8794.
[82]
Choi J., Kim S., Jeong Y., Gwon Y., Yoon S., Ilvr: Conditioning method for denoising diffusion probabilistic models, 2021, arXiv preprint arXiv:2108.02938.
[83]
Zhao M., Bao F., Li C., Zhu J., Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations, 2022, arXiv preprint arXiv:2207.06635.
[84]
Meng C., Song Y., Song J., Wu J., Zhu J.-Y., Ermon S., Sdedit: Image synthesis and editing with stochastic differential equations, 2021, arXiv preprint arXiv:2108.01073.
[85]
Wang W., Bao J., Zhou W., Chen D., Chen D., Yuan L., et al., Semantic image synthesis via diffusion models, 2022, arXiv preprint arXiv:2207.00050.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computers and Graphics
Computers and Graphics  Volume 111, Issue C
Apr 2023
230 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 April 2023

Author Tags

  1. Generative adversarial networks
  2. Semantic image synthesis
  3. Unpaired I2I translation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media