⚡️Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

💡 Introduction

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include:

(1) DC-AE: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens.
(2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality.
(3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment.
(4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.

As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. Sana enables content creation at low cost.

🔥🔥 News

Sana code is coming soon
(🔥 New) [2024/10] Demo is released.
(🔥 New) [2024/10] DC-AE Code and weights are released!
[2024/10] Paper is on Arxiv!

Performance

Methods	Throughput (samples/s)	Latency (s)	Params (B)	Speedup	FID 👆	CLIP 👆	GenEval 👆	DPG 👆
512 × 512 resolution
PixArt-α	1.5	1.2	0.6	1.0×	6.14	27.55	0.48	71.6
PixArt-Σ	1.5	1.2	0.6	1.0×	6.34	27.62	0.52	79.5
Sana-0.6B	6.7	0.8	0.6	5.0×	5.67	27.92	0.64	84.3
Sana-1.6B	3.8	0.6	1.6	2.5×	5.16	28.19	0.66	85.5
1024 × 1024 resolution
LUMINA-Next	0.12	9.1	2.0	2.8×	7.58	26.84	0.46	74.6
SDXL	0.15	6.5	2.6	3.5×	6.63	29.03	0.55	74.7
PlayGroundv2.5	0.21	5.3	2.6	4.9×	6.09	29.13	0.56	75.5
Hunyuan-DiT	0.05	18.2	1.5	1.2×	6.54	28.19	0.63	78.9
PixArt-Σ	0.4	2.7	0.6	9.3×	6.15	28.26	0.54	80.5
DALLE3	-	-	-	-	-	-	0.67	83.5
SD3-medium	0.28	4.4	2.0	6.5×	11.92	27.83	0.62	84.1
FLUX-dev	0.04	23.0	12.0	1.0×	10.15	27.47	0.67	84.0
FLUX-schnell	0.5	2.1	12.0	11.6×	7.94	28.14	0.71	84.8
Sana-0.6B	1.7	0.9	0.6	39.5×	5.81	28.36	0.64	83.6
Sana-1.6B	1.0	1.2	1.6	23.3×	5.76	28.67	0.66	84.8

💪To-Do List

We will try our best to release

🤗Acknowledgements

Thanks to PixArt-α, PixArt-Σ and Efficient-ViT for their wonderful work and codebase!

📖BibTeX

@misc{xie2024sana,
      title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
      author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2410.10629},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.10629},
    }

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
asset		asset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡️Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

💡 Introduction

🔥🔥 News

Performance

Contents

💪To-Do List

🤗Acknowledgements

📖BibTeX

About

Releases

Packages

Contributors 2

License

NVlabs/Sana

Folders and files

Latest commit

History

Repository files navigation

⚡️Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

💡 Introduction

🔥🔥 News

Performance

Contents

💪To-Do List

🤗Acknowledgements

📖BibTeX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages