DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Teng, Yao; Wu, Yue; Shi, Han; Ning, Xuefei; Dai, Guohao; Wang, Yu; Li, Zhenguo; Liu, Xihui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.14224 (cs)

[Submitted on 23 May 2024 (v1), last revised 10 Jul 2024 (this version, v2)]

Title:DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Authors:Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu

View PDF HTML (experimental)

Abstract:Diffusion models have achieved great success in image generation, with the backbone evolving from U-Net to Vision Transformers. However, the computational cost of Transformers is quadratic to the number of tokens, leading to significant challenges when dealing with high-resolution images. In this work, we propose Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence model based on State Space Models (SSM), with the expressive power of diffusion models for efficient high-resolution image synthesis. To address the challenge that Mamba cannot generalize to 2D signals, we make several architecture designs including multi-directional scans, learnable padding tokens at the end of each row and column, and lightweight local feature enhancement. Our DiM architecture achieves inference-time efficiency for high-resolution images. In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate "weak-to-strong" training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$). We further explore training-free upsampling strategies to enable the model to generate higher-resolution images (e.g., $1024\times 1024$ and $1536\times 1536$) without further fine-tuning. Experiments demonstrate the effectiveness and efficiency of our DiM. The code of our work is available here: {\url{this https URL}}.

Comments:	The code of our work is available here: {\url{this https URL}}
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.14224 [cs.CV]
	(or arXiv:2405.14224v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.14224

Submission history

From: Yao Teng [view email]
[v1] Thu, 23 May 2024 06:53:18 UTC (8,579 KB)
[v2] Wed, 10 Jul 2024 09:02:11 UTC (8,579 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators