Scaling Diffusion Transformers to 16 Billion Parameters

Fei, Zhengcong; Fan, Mingyuan; Yu, Changqian; Li, Debang; Huang, Junshi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.11633 (cs)

[Submitted on 16 Jul 2024 (v1), last revised 9 Sep 2024 (this version, v3)]

Title:Scaling Diffusion Transformers to 16 Billion Parameters

Authors:Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

View PDF HTML (experimental)

Abstract:In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.11633 [cs.CV]
	(or arXiv:2407.11633v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.11633

Submission history

From: Zhengcong Fei [view email]
[v1] Tue, 16 Jul 2024 11:55:23 UTC (10,645 KB)
[v2] Fri, 6 Sep 2024 06:00:42 UTC (10,943 KB)
[v3] Mon, 9 Sep 2024 02:17:12 UTC (10,943 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Diffusion Transformers to 16 Billion Parameters

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Scaling Diffusion Transformers to 16 Billion Parameters

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators