Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, Jiahui; Xu, Yuanzhong; Koh, Jing Yu; Luong, Thang; Baid, Gunjan; Wang, Zirui; Vasudevan, Vijay; Ku, Alexander; Yang, Yinfei; Ayan, Burcu Karagol; Hutchinson, Ben; Han, Wei; Parekh, Zarana; Li, Xin; Zhang, Han; Baldridge, Jason; Wu, Yonghui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.10789 (cs)

[Submitted on 22 Jun 2022]

Title:Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Authors:Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

View PDF

Abstract:We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See this https URL for high-resolution images.

Comments:	Preprint
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2206.10789 [cs.CV]
	(or arXiv:2206.10789v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.10789

Submission history

From: Jiahui Yu [view email]
[v1] Wed, 22 Jun 2022 01:11:29 UTC (48,229 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators