research-article

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Authors:

Yongdong ZhangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4345 - 4354

https://doi.org/10.1145/3503161.3547881

Published: 10 October 2022 Publication History

Get Access

Abstract

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial training, where the text semantics used to provide generation guidance remain static across all stages. This work argues that text features at each stage should be adaptively re-composed conditioned on the status of the historical stage (\emphi.e., historical stage's text and image features) to provide diversified and accurate semantic guidance during the coarse-to-fine generation process. We thereby propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture. Specifically, we design (1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical image features to summarize the generative feedback, and then dynamically selects words required to be re-composed at each stage as well as re-composed them by dynamically enhancing or suppressing different granularity subspace's semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which extends the previous structure by eliminating complicated multiple adversarial training requirements and therefore allows more stages of text-image interactions, and finally facilitates the DSE module. We conduct comprehensive experiments and show that DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO, respectively.

Supplementary Material

MP4 File (MM22-fp0545.mp4)

We propose a novel sequential generation framework on both text and images for T2I, i.e., DSE-GAN, which dynamically re-composes text features based on the historical stage. To the best of our knowledge, this is the first framework in T2I that adaptively re-composes text features at each stage. We propose the Dynamic Semantic Evolution (DSE) module, which dynamically re-composes text features at different stages, providing diversified and accurate coarse-to-fine semantic guidance and suppressing repeated rendering.

Download
16.12 MB

References

[1]

Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2015. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297 (2015).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Attribute-guided image generation of three-dimensional computed tomography images of lung nodules using a generative adversarial network

Generative adversarial network based on semantic consistency for text-to-image generation

Fractional‐Harris hawks optimization‐based generative adversarial network for osteosarcoma detection using Renyi entropy‐hybrid fusion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations