CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks

Schulze, Henning; Yaman, Dogucan; Waibel, Alexander

doi:10.1007/978-3-030-92659-5_25

Henning Schulze¹¹,
Dogucan Yaman¹¹ &
Alexander Waibel¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13024))

Included in the following conference series:

DAGM German Conference on Pattern Recognition

1713 Accesses
4 Citations
2 Altmetric

Abstract

Generating images according to natural language descriptions is a challenging task. Prior research has mainly focused to enhance the quality of generation by investigating the use of spatial attention and/or textual attention thereby neglecting the relationship between channels. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN achieves state-of-the-art FID and comparative IS scores on the CUB dataset and on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS than our other model, but generates unrealistic images through feature repetition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks

Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis

Text-to-Image Synthesis: A Comparative Study

Notes

1.
https://github.com/bioinf-jku/TTUR.

References

Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Article Google Scholar
Barratt, S.T., Sharma, R.: A note on the inception score. CoRR abs/1801.01973 (2018)
Google Scholar
Cai, Y., et al.: Dualattn-GAN: text to image synthesis with dual attentional generative adversarial network. IEEE Access 7, 183706–183716 (2019)
Article Google Scholar
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge. In: CVPR, pp. 10908–10917 (2020)
Google Scholar
Cheng, Q., Gu, X.: Hybrid attention driven text-to-image synthesis via generative adversarial networks. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 483–495. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_47
Chapter Google Scholar
Dorta, G., Vicente, S., Agapito, L., Campbell, N.D.F., Prince, S., Simpson, I.: Laplacian pyramid of conditional variational autoencoders. In: CVMP, pp. 7:1–7:9 (2017)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Google Scholar
Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., Kembhavi, A.: Imagine this! Scripts to compositions to videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 610–626. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_37
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS, pp. 6626–6637 (2017)
Google Scholar
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. CoRR abs/1910.13321 (2019)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR, pp. 1219–1228 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Google Scholar
Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.K.: Picture: a probabilistic programming language for scene perception. In: CVPR, pp. 4390–4399 (2015)
Google Scholar
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)
Google Scholar
Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.S.: Controllable text-to-image generation. In: NIPS, pp. 2063–2073 (2019)
Google Scholar
Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR, pp. 12174–12182 (2019)
Google Scholar
Li, Z., Wu, M., Zheng, J., Yu, H.: Perceptual adversarial networks with a feature pyramid for image translation. IEEE CG&A 39(4), 68–77 (2019)
Google Scholar
Liang, J., Pei, W., Lu, F.: CPGAN: full-spectrum content-parsing generative adversarial networks for text-to-image synthesis. CoRR abs/1912.08562 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs created equal? A large-scale study. In: NIPS, pp. 698–707 (2018)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR, Conference Track Proceedings. OpenReview.net (2018)
Google Scholar
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of Machine Learning Research, ICML, vol. 70, pp. 2642–2651. PMLR (2017)
Google Scholar
van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: NIPS, pp. 4790–4798 (2016)
Google Scholar
Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: NIPS, pp. 68–80 (2019)
Google Scholar
Pharr, M., Jakob, W., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, Burlington (2016)
Google Scholar
Qiao, T., Zhang, J., Xu, D., Tao, D.: Learn, imagine and create: text-to-image generation from prior knowledge. In: NIPS, pp. 885–895 (2019)
Google Scholar
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: CVPR, pp. 1505–1514 (2019)
Google Scholar
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NIPS, pp. 217–225 (2016)
Google Scholar
Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: JMLR Workshop and Conference Proceedings, ICML, vol. 48, pp. 1060–1069 (2016)
Google Scholar
Reed, S.E., et al.: Parallel multiscale autoregressive density estimation. In: Proceedings of Machine Learning Research, ICML, vol. 70, pp. 2912–2921 (2017)
Google Scholar
Rosca, M., Lakshminarayanan, B., Warde-Farley, D., Mohamed, S.: Variational approaches for auto-encoding generative adversarial networks. CoRR abs/1706.04987 (2017)
Google Scholar
Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)
Google Scholar
Snell, J., Ridgeway, K., Liao, R., Roads, B.D., Mozer, M.C., Zemel, R.S.: Learning to generate images with perceptual similarity metrics. In: ICIP, pp. 4277–4281 (2017)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)
Google Scholar
Tan, H., Liu, X., Li, X., Zhang, Y., Yin, B.: Semantics-enhanced adversarial nets for text-to-image synthesis. In: ICCV, pp. 10500–10509 (2019)
Google Scholar
Tao, M., Tang, H., Wu, S., Sebe, N., Wu, F., Jing, X.: DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. CoRR abs/2008.05865 (2020)
Google Scholar
Theis, L., Bethge, M.: Generative image modeling using spatial LSTMs. In: NIPS, pp. 1927–1935 (2015)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001 (2011)
Google Scholar
Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: CVPR (2017)
Google Scholar
Xie, J., Lu, Y., Zhu, S., Wu, Y.N.: A theory of generative convnet. In: JMLR Workshop and Conference Proceedings, ICML, vol. 48, pp. 2635–2644 (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: JMLR Workshop and Conference Proceedings, ICML, vol. 37, pp. 2048–2057 (2015)
Google Scholar
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)
Google Scholar
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: CVPR, pp. 2327–2336 (2019)
Google Scholar
Zhang, H., Xu, T., Li, H.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp. 5908–5916 (2017)
Google Scholar
Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. CoRR abs/1710.10916 (2017)
Google Scholar
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR, pp. 5802–5810 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany
Henning Schulze, Dogucan Yaman & Alexander Waibel

Authors

Henning Schulze
View author publications
You can also search for this author in PubMed Google Scholar
Dogucan Yaman
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Waibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henning Schulze .

Editor information

Editors and Affiliations

Fraunhofer IAIS, Sankt Augustin, Germany
Christian Bauckhage
University of Bonn, Bonn, Germany
Juergen Gall
University of Illinois at Urbana-Champaign, Urbana, IL, USA
Alexander Schwing

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schulze, H., Yaman, D., Waibel, A. (2021). CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks. In: Bauckhage, C., Gall, J., Schwing, A. (eds) Pattern Recognition. DAGM GCPR 2021. Lecture Notes in Computer Science(), vol 13024. Springer, Cham. https://doi.org/10.1007/978-3-030-92659-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-92659-5_25
Published: 13 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92658-8
Online ISBN: 978-3-030-92659-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks

Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis

Text-to-Image Synthesis: A Comparative Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks

Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis

Text-to-Image Synthesis: A Comparative Study

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation