Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-53311-2_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Published: 29 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Text-to-image synthesis has been a popular multimodal task in recent years, which faces two major challenges: the semantic consistency and the fine-grained information loss. Existing methods mostly adopt either a multi-stage stacked architecture or a single-stream model with several affine transformations as the fusion block. The former requires additional networks to ensure the semantic consistency between text and image, which is complex and results in poor generation quality. The latter simply extracts affine transformation from Conditional Batch Normalization (CBN), which can not match text features well. To address these issues, we propose an effective Conditional Adaptive Generative Adversarial Network. Our proposed method (i.e., CA-GAN) adopts a single-stream network architecture, consisting of a single generator/discriminator pair. To be specific, we propose: (1) a conditional adaptive instance normalization residual block which promotes the generator to synthesize high quality images containing semantic information; (2) an attention block that focuses on image-related channels and pixels. We conduct extensive experiments on CUB and COCO datasets, and the results show the superiority of the proposed CA-GAN in text-to-image synthesis tasks compared with previous methods.

    References

    [1]
    Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014)
    [2]
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems 30 (2017)
    [3]
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
    [4]
    Huang S and Chen Y Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis Digital. Signal Proc. 2022 120
    [5]
    Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
    [6]
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
    [7]
    Li, B., Qi, X., Lukasiewicz, T., Torr, P.: Controllable text-to-image generation. Adv. Neural Inf. Process. Syst. 32 (2019)
    [8]
    Lim, J.H., Ye, J.C.: Geometric GAN. arXiv preprint. arXiv:1705.02894 (2017)
    [9]
    Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne Microsoft COCO: Common Objects in Context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
    [10]
    Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
    [11]
    Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)
    [12]
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANS. Adv. Neural Inf. Process. Syst. 29 (2016)
    [13]
    Schuster M and Paliwal KK Bidirectional recurrent neural networks IEEE Trans. Signal Process. 1997 45 11 2673-2681
    [14]
    Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: DF-GAN: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
    [15]
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset (2011)
    [16]
    Xu, T., et al.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
    [17]
    yang Y et al. Þór Jónsson Björn, Gurrin Cathal, Tran Minh-Triet, Dang-Nguyen Duc-Tien, Hu Anita Min-Chun, Huynh Thi Thanh Binh, Huet Benoit, et al. MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis MultiMedia Modeling 2022 Cham Springer 41-53
    [18]
    Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)
    [19]
    Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019)
    [20]
    Zhang, H., et al.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
    [21]
    Zhang, Z., Schomaker, L.: DTGAN: Dual attention generative adversarial networks for text-to-image generation. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
    [22]
    Zhu, J., Li, Z., Ma, H.: TT2INet: Text to photo-realistic image synthesis with transformer as text encoder. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2021)
    [23]
    Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III
    Jan 2024
    551 pages
    ISBN:978-3-031-53310-5
    DOI:10.1007/978-3-031-53311-2
    • Editors:
    • Stevan Rudinac,
    • Alan Hanjalic,
    • Cynthia Liem,
    • Marcel Worring,
    • Björn Þór Jónsson,
    • Bei Liu,
    • Yoko Yamakata

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 29 January 2024

    Author Tags

    1. Multi-modal
    2. Text-to-Image
    3. GAN

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media