Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

End-to-End Text-to-Image Synthesis with Spatial Constrains

Published: 25 May 2020 Publication History

Abstract

Although the performance of automatically generating high-resolution realistic images from text descriptions has been significantly boosted, many challenging issues in image synthesis have not been fully investigated, due to shapes variations, viewpoint changes, pose changes, and the relations of multiple objects. In this article, we propose a novel end-to-end approach for text-to-image synthesis with spatial constraints by mining object spatial location and shape information. Instead of learning a hierarchical mapping from text to image, our algorithm directly generates multi-object fine-grained images through the guidance of the generated semantic layouts. By fusing text semantic and spatial information into a synthesis module and jointly fine-tuning them with multi-scale semantic layouts generated, the proposed networks show impressive performance in text-to-image synthesis for complex scenes. We evaluate our method both on single-object CUB dataset and multi-object MS-COCO dataset. Comprehensive experimental results demonstrate that our method significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

References

[1]
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 214--223.
[2]
Peter J. Burt and Edward H. Adelson. 1987. The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 31, 4 (1987), 671--679.
[3]
Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1511--1520.
[4]
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu. 2009. Algorithms for distributional and adversarial pipelined filter ordering problems. ACM Trans. Algor. 5, 2 (2009), 24:1--24:34.
[5]
Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). MIT Press, Cambridge, MA, 1486--1494.
[6]
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2015. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1538–1546.
[7]
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2018. Plug 8 play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3510–3520.
[8]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Xu Bing, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2. MIT Press, Cambridge, MA, 2672--2680.
[9]
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. 1462--1471.
[10]
Éric Guérin, Julie Digne, Éric Galin, Adrien Peytavie, Christian Wolf, Bedrich Benes, and Benoît Martinez. 2017. Interactive example-based terrain authoring with conditional generative adversarial networks. ACM Trans. Graph. 36, 6 (2017), 228:1--228:13.
[11]
Zhang Han, Xu Tao, and Hongsheng Li. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 1. 5908--5916.
[12]
Zhang Han, Xu Tao, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. 2018. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. (2018), 1--1.
[13]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, 6626--6637.
[14]
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. 2018. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[15]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5967--5976.
[16]
Yuting Jia, Qinqin Zhang, Weinan Zhang, and Xinbing Wang. 2019. CommunityGAN: Community detection with generative adversarial nets. In Proceedings of the World Wide Web Conference. 784--794.
[17]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV’16), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 694--711.
[18]
Justin Johnson, Alexandre Alahi, and Fei Fei Li. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV’16). 694--711.
[19]
Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2016. Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv:1612.00215.
[20]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR'15). 1--15.
[21]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR'14). 1--14.
[22]
Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 105--114.
[23]
Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. 2017. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1951--1959.
[24]
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 12174--12182.
[25]
Zun Li, Congyan Lang, Jiashi Feng, Yidong Li, Tao Wang, and Songhe Feng. 2019. Co-saliency detection with graph matching. ACM Trans. Intell. Syst. Technol. 10, 3 (2019), 22:1--22:22.
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[27]
Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 1. 2813--2821.
[28]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. Comput. Sci. (2014), 2672--2680.
[29]
Shakir Mohamed and Daan Wierstra. 2014. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Technical Report.
[30]
Augustus Odena, Christopher Olah, and Jonathon Shlen. 2017. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. 2642--2651.
[31]
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 2863--2871.
[32]
Aaron Van Den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 4790--4798.
[33]
Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM International Conference on Multimedia. 1789--1798.
[34]
Hyojin Park, Youngjoon Yoo, and Nojun Kwak. 2018. MC-GAN: Multi-conditional generative adversarial network for image synthesis. In The British Machine Vision Conference (BMVC'18). 1--13.
[35]
Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR’16), Vol. 2.
[36]
Scott Reed, Zeynep Akata, Honglak Lee, and Bernte Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Vol. 1. 49--58.
[37]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesiss. In Proceedings of the 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. 1060--1069.
[38]
Scott E. Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garne (Eds.). Curran Associates, 217--225.
[39]
Scott E. Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. 2015. Deep visual analogy-making. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 1252--1260.
[40]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds 200-2011 Dataset. Technical Report.
[41]
Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2018. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans. Audio Speech Lang. Process. 26, 1 (2018), 84--96.
[42]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Chen Xi. 2016. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, Vol. 29. Curran Associates, 2234--2242.
[43]
Yuanlong Shao, Yuan Zhou, and Deng Cai. 2011. Variational inference with graph regularization for image annotation. ACM Trans. Intell. Syst. Technol. 2, 2 (2011), 11:1--11:21.
[44]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Vol. 1. 2818--2826.
[45]
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. MoCoGAN: Decomposing motion and content for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1526--1535.
[46]
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems 29. Curran Associates, 613--621.
[47]
Liwei Wang, Li Yin, and Svetlana Lazebnik. 2017. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2017), 394--407.
[48]
Bo Xin, Yoshinobu Kawahara, Yizhou Wang, Lingjing Hu, and Wen Gao. 2016. Efficient generalized fused lasso and its applications. ACM Trans. Intell. Syst. Technol. 7, 4 (2016), 60:1--60:22.
[49]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[50]
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2Image: Conditional image generation from visual attributes. In Proceedings of the European Conference on Computer Vision (ECCV’16), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 776--791.
[51]
Jimei Yang, Scott E. Reed, Ming-Hsuan Yang, and Honglak Lee. 2015. Weakly supervised disentangling with recurrent transformations for 3D view synthesis. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 1099--1107.
[52]
Zeng Yu, Tianrui Li, Ning Yu, Yi Pan, Hongmei Chen, and Bing Liu. 2019. Reconstruction of hidden representation for robust feature extraction. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 18:1--18:24.
[53]
Jichao Zhang, Yezhi Shu, Songhua Xu, Gongze Cao, Fan Zhong, Meng Liu, and Xueying Qin. 2018. Sparsely grouped multi-task generative adversarial networks for facial attribute manipulation. In Proceedings of the 26th ACM International Conference on Multimedia. 392--401.
[54]
Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao Liu, Zequn Jie, and Jiashi Feng. 2018. Multi-view image generation from a single-view. In Proceedings of the 26th ACM International Conference on Multimedia. 383--391.
[55]
Junbo Zhao, Michael Mathieu, and Yann Lecun. 2017. Energy-based generative adversarial network. In Proceedings of the International Conference on Learning Representations (ICLR’17), Vol. 2.
[56]
Yiru Zhao, Bing Deng, Jianqiang Huang, Hongtao Lu, and Xian-Sheng Hua. 2017. Stylized adversarial autoencoder for image generation. In Proceedings of the 25th ACM International Conference on Multimedia. 244--251.
[57]
Jun Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. 2016. Generative visual manipulation on the natural image manifold. In Proceedings of the European Conference on Computer Vision (ECCV’16). 597--613.
[58]
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 5802--5810.

Cited By

View all
  • (2023)A survey of generative adversarial networks and their application in text-to-image synthesisElectronic Research Archive10.3934/era.202336231:12(7142-7181)Online publication date: 2023
  • (2023)TAM GAN: Tamil Text to Naturalistic Image Synthesis Using Conventional Deep Adversarial NetworksACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358401922:5(1-18)Online publication date: 16-Feb-2023
  • (2023)Multimodal Image Synthesis and Editing: The Generative AI EraIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.330524345:12(15098-15119)Online publication date: Dec-2023
  • Show More Cited By

Index Terms

  1. End-to-End Text-to-Image Synthesis with Spatial Constrains

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 11, Issue 4
      Survey Paper and Regular Paper
      August 2020
      358 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3401889
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 May 2020
      Online AM: 07 May 2020
      Accepted: 01 March 2020
      Revised: 01 March 2020
      Received: 01 July 2019
      Published in TIST Volume 11, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CUB
      2. Computer vision
      3. MS-COCO
      4. spatial constrain
      5. text-to-image synthesis

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Beijing Natural Science Foundation
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)46
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A survey of generative adversarial networks and their application in text-to-image synthesisElectronic Research Archive10.3934/era.202336231:12(7142-7181)Online publication date: 2023
      • (2023)TAM GAN: Tamil Text to Naturalistic Image Synthesis Using Conventional Deep Adversarial NetworksACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358401922:5(1-18)Online publication date: 16-Feb-2023
      • (2023)Multimodal Image Synthesis and Editing: The Generative AI EraIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.330524345:12(15098-15119)Online publication date: Dec-2023
      • (2023)Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial NetworksIEEE Transactions on Multimedia10.1109/TMM.2022.321738425(7062-7075)Online publication date: 2023
      • (2023)Text Guided Image Inpainting Based on Generative Adversarial Network2023 8th International Conference on Computational Intelligence and Applications (ICCIA)10.1109/ICCIA59741.2023.00031(128-132)Online publication date: 23-Jun-2023
      • (2023)Vision + Language Applications: A Survey2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00090(826-842)Online publication date: Jun-2023
      • (2023)Recent Advances in Text-to-Image Synthesis: Approaches, Datasets and Future Research ProspectsIEEE Access10.1109/ACCESS.2023.330642211(88099-88115)Online publication date: 2023
      • (2023)Enhanced Text-to-Image Synthesis With Self-SupervisionIEEE Access10.1109/ACCESS.2023.326886911(39508-39519)Online publication date: 2023
      • (2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
      • (2022)aRTIC GAN: A Recursive Text-Image-Conditioned GANElectronics10.3390/electronics1111173711:11(1737)Online publication date: 30-May-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media