SanaPipeline¶

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.

The abstract from the paper is:

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

Tip

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

This pipeline was contributed by lawrence-cj and chenjy2003. The original codebase can be found here. The original weights can be found under hf.co/Efficient-Large-Model.

Available models:

Model	Recommended dtype
`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`	`mindspore.bfloat16`
`Efficient-Large-Model/Sana_1600M_1024px_diffusers`	`mindspore.float16`
`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`	`mindspore.float16`
`Efficient-Large-Model/Sana_1600M_512px_diffusers`	`mindspore.float16`
`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`	`mindspore.float16`
`Efficient-Large-Model/Sana_600M_1024px_diffusers`	`mindspore.float16`
`Efficient-Large-Model/Sana_600M_512px_diffusers`	`mindspore.float16`

Refer to this collection for more information.

Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in mindspore.bfloat16 or mindspore.float32 for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.

Tip

Make sure to pass the variant argument for downloaded checkpoints to use lower disk space. Set it to "fp16" for models with recommended dtype as mindspore.float16, and "bf16" for models with recommended dtype as mindspore.bfloat16. By default, mindspore.float32 weights are downloaded, which use twice the amount of disk storage. Additionally, mindspore.float32 weights can be downcasted on-the-fly by specifying the mindspore_dtype argument. Read about it in the docs.