Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Yu, Sihyun; Kwak, Sangkyung; Jang, Huiwon; Jeong, Jongheon; Huang, Jonathan; Shin, Jinwoo; Xie, Saining

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.06940 (cs)

[Submitted on 9 Oct 2024 (v1), last revised 5 Dec 2024 (this version, v2)]

Title:Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Authors:Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie

View PDF

Abstract:Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

Comments:	Preprint. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2410.06940 [cs.CV]
	(or arXiv:2410.06940v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.06940

Submission history

From: Sihyun Yu [view email]
[v1] Wed, 9 Oct 2024 14:34:53 UTC (24,080 KB)
[v2] Thu, 5 Dec 2024 07:39:22 UTC (37,197 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators