Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13024))

Included in the following conference series:

Abstract

Generating images according to natural language descriptions is a challenging task. Prior research has mainly focused to enhance the quality of generation by investigating the use of spatial attention and/or textual attention thereby neglecting the relationship between channels. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN achieves state-of-the-art FID and comparative IS scores on the CUB dataset and on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS than our other model, but generates unrealistic images through feature repetition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/bioinf-jku/TTUR.

References

  1. Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)

    Article  Google Scholar 

  2. Barratt, S.T., Sharma, R.: A note on the inception score. CoRR abs/1801.01973 (2018)

    Google Scholar 

  3. Cai, Y., et al.: Dualattn-GAN: text to image synthesis with dual attentional generative adversarial network. IEEE Access 7, 183706–183716 (2019)

    Article  Google Scholar 

  4. Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge. In: CVPR, pp. 10908–10917 (2020)

    Google Scholar 

  5. Cheng, Q., Gu, X.: Hybrid attention driven text-to-image synthesis via generative adversarial networks. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 483–495. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_47

    Chapter  Google Scholar 

  6. Dorta, G., Vicente, S., Agapito, L., Campbell, N.D.F., Prince, S., Simpson, I.: Laplacian pyramid of conditional variational autoencoders. In: CVMP, pp. 7:1–7:9 (2017)

    Google Scholar 

  7. Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)

    Google Scholar 

  8. Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

    Google Scholar 

  9. Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., Kembhavi, A.: Imagine this! Scripts to compositions to videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 610–626. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_37

    Chapter  Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  11. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS, pp. 6626–6637 (2017)

    Google Scholar 

  12. Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. CoRR abs/1910.13321 (2019)

    Google Scholar 

  13. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

    Google Scholar 

  14. Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: CVPR, pp. 1219–1228 (2018)

    Google Scholar 

  15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)

    Google Scholar 

  16. Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.K.: Picture: a probabilistic programming language for scene perception. In: CVPR, pp. 4390–4399 (2015)

    Google Scholar 

  17. Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS, pp. 2539–2547 (2015)

    Google Scholar 

  18. Li, B., Qi, X., Lukasiewicz, T., Torr, P.H.S.: Controllable text-to-image generation. In: NIPS, pp. 2063–2073 (2019)

    Google Scholar 

  19. Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR, pp. 12174–12182 (2019)

    Google Scholar 

  20. Li, Z., Wu, M., Zheng, J., Yu, H.: Perceptual adversarial networks with a feature pyramid for image translation. IEEE CG&A 39(4), 68–77 (2019)

    Google Scholar 

  21. Liang, J., Pei, W., Lu, F.: CPGAN: full-spectrum content-parsing generative adversarial networks for text-to-image synthesis. CoRR abs/1912.08562 (2019)

    Google Scholar 

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Lucic, M., Kurach, K., Michalski, M., Gelly, S., Bousquet, O.: Are GANs created equal? A large-scale study. In: NIPS, pp. 698–707 (2018)

    Google Scholar 

  24. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR, Conference Track Proceedings. OpenReview.net (2018)

    Google Scholar 

  25. Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of Machine Learning Research, ICML, vol. 70, pp. 2642–2651. PMLR (2017)

    Google Scholar 

  26. van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: NIPS, pp. 4790–4798 (2016)

    Google Scholar 

  27. Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: NIPS, pp. 68–80 (2019)

    Google Scholar 

  28. Pharr, M., Jakob, W., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  29. Qiao, T., Zhang, J., Xu, D., Tao, D.: Learn, imagine and create: text-to-image generation from prior knowledge. In: NIPS, pp. 885–895 (2019)

    Google Scholar 

  30. Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: CVPR, pp. 1505–1514 (2019)

    Google Scholar 

  31. Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NIPS, pp. 217–225 (2016)

    Google Scholar 

  32. Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: JMLR Workshop and Conference Proceedings, ICML, vol. 48, pp. 1060–1069 (2016)

    Google Scholar 

  33. Reed, S.E., et al.: Parallel multiscale autoregressive density estimation. In: Proceedings of Machine Learning Research, ICML, vol. 70, pp. 2912–2921 (2017)

    Google Scholar 

  34. Rosca, M., Lakshminarayanan, B., Warde-Farley, D., Mohamed, S.: Variational approaches for auto-encoding generative adversarial networks. CoRR abs/1706.04987 (2017)

    Google Scholar 

  35. Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)

    Google Scholar 

  36. Snell, J., Ridgeway, K., Liao, R., Roads, B.D., Mozer, M.C., Zemel, R.S.: Learning to generate images with perceptual similarity metrics. In: ICIP, pp. 4277–4281 (2017)

    Google Scholar 

  37. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)

    Google Scholar 

  38. Tan, H., Liu, X., Li, X., Zhang, Y., Yin, B.: Semantics-enhanced adversarial nets for text-to-image synthesis. In: ICCV, pp. 10500–10509 (2019)

    Google Scholar 

  39. Tao, M., Tang, H., Wu, S., Sebe, N., Wu, F., Jing, X.: DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. CoRR abs/2008.05865 (2020)

    Google Scholar 

  40. Theis, L., Bethge, M.: Generative image modeling using spatial LSTMs. In: NIPS, pp. 1927–1935 (2015)

    Google Scholar 

  41. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001 (2011)

    Google Scholar 

  42. Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: CVPR (2017)

    Google Scholar 

  43. Xie, J., Lu, Y., Zhu, S., Wu, Y.N.: A theory of generative convnet. In: JMLR Workshop and Conference Proceedings, ICML, vol. 48, pp. 2635–2644 (2016)

    Google Scholar 

  44. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: JMLR Workshop and Conference Proceedings, ICML, vol. 37, pp. 2048–2057 (2015)

    Google Scholar 

  45. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)

    Google Scholar 

  46. Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: CVPR, pp. 2327–2336 (2019)

    Google Scholar 

  47. Zhang, H., Xu, T., Li, H.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp. 5908–5916 (2017)

    Google Scholar 

  48. Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. CoRR abs/1710.10916 (2017)

    Google Scholar 

  49. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR, pp. 5802–5810 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henning Schulze .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schulze, H., Yaman, D., Waibel, A. (2021). CAGAN: Text-To-Image Generation with Combined Attention Generative Adversarial Networks. In: Bauckhage, C., Gall, J., Schwing, A. (eds) Pattern Recognition. DAGM GCPR 2021. Lecture Notes in Computer Science(), vol 13024. Springer, Cham. https://doi.org/10.1007/978-3-030-92659-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92659-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92658-8

  • Online ISBN: 978-3-030-92659-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics