SANA (Efficient High-Resolution Image Synthesis)¶
Efficient DiT from NVlabs/MIT Han Lab. 600M-4.8B params with competitive quality at 1024-4096px. Uses linear attention O(n), DC-AE 32× compression, and Gemma-2-2B text encoder. ICLR 2025 Oral.
Architecture — Full Detail¶
Model Variants¶
| Variant | Params | Depth | Hidden | Heads | FFN (2.5× MLP) |
|---|---|---|---|---|---|
| Sana-0.6B | 590M | 28 blocks | 1152 | 16 | 2880 |
| Sana-1.6B | 1604M | 20 blocks | 2240 | 20 | 5600 |
| Sana-1.5 4.8B | 4800M | 60 blocks | 2240 | 20 | 5600 |
0.6B = deeper but narrower. 1.6B = shallower but wider. 4.8B scales ONLY depth (20→60), width same as 1.6B.
SanaBlock Structure (AdaLN-Zero)¶
Input x → LayerNorm → modulate(shift1, scale1) → Linear Self-Attention (LiteLA)
→ LayerNorm → modulate(shift2, scale2) → Cross-Attention (standard, with text)
→ LayerNorm → modulate(shift3, scale3) → Mix-FFN (GLUMBConv)
Output x (residual at each stage, 6 modulation params per block via scale_shift_table)
Linear Attention (LiteLA with ReLU Kernel)¶
# Standard quadratic: O(N^2)
# Attention = softmax(QK^T / sqrt(d)) * V
# SANA linear: O(N * d^2)
# phi(x) = ReLU(x)
# Shared terms: S = sum_j phi(K_j)^T * V_j (shape d×d, computed ONCE)
# Z = sum_j phi(K_j)^T (shape d×1, computed ONCE)
# Output_i = phi(Q_i) * S / (phi(Q_i) * Z + eps)
Trade-off: linear attention alone degrades quality. Compensated by Mix-FFN with 3×3 depthwise convolution that captures local spatial info lost by ReLU kernel (no softmax locality bias).
Triton kernel fusion: ReLU activation + precision conversions + padding + division fused into matmul → ~10% speed acceleration.
Position encoding: "NoPE" (No Positional Embeddings) — 3×3 depthwise conv in Mix-FFN implicitly encodes position. Alternatively supports RoPE theta=10000, axes_dim=[0,16,16].
Mix-FFN (GLUMBConv)¶
Replaces standard MLP:
The 3×3 depthwise conv = key to making linear attention work. Provides local receptive field that ReLU linear attention lacks.DC-AE (Deep Compression Autoencoder) — F32C32P1¶
32× spatial compression, 32 latent channels, patch size 1.
| Resolution | SD/FLUX (F8, P2) tokens | SANA (F32, P1) tokens | Reduction |
|---|---|---|---|
| 512×512 | 1024 | 256 (16×16) | 4× |
| 1024×1024 | 4096 | 1024 (32×32) | 4× |
| 2048×2048 | 16384 | 4096 (64×64) | 4× |
| 4096×4096 | 65536 | 16384 (128×128) | 4× |
4× fewer tokens + O(n) linear attention = orders of magnitude faster at high res.
Reconstruction quality (ImageNet): rFID 0.34, PSNR 29.29, SSIM 0.84, LPIPS 0.05.
Tiling supported: pipe.vae.enable_tiling(tile_sample_min_height=1024, tile_sample_min_width=1024) — enables 4K within 22 GB VRAM.
Two versions: dc-ae-f32c32-sana-1.0, dc-ae-f32c32-sana-1.1 (improved), dc-ae-lite-f32c32 (faster/smaller).
Text Encoder: Gemma-2-2B-IT¶
Decoder-only LLM (not T5). 6× faster than T5-XXL. Max 300 tokens.
Critical: decoder-only outputs have variance orders of magnitude larger than T5. Solution: RMSNorm after encoder + learnable scale 0.01 (y_norm: true, y_norm_scale: 0.01).
Complex Human Instruction (CHI): leverages Gemma's in-context learning → +2.2 GenEval points.
Training¶
Loss & Scheduler¶
Flow Matching velocity prediction: v_theta(x_t, t) = epsilon - x_0. Timestep sampling: logit-normal (mean=0.0, std=1.0). Flow shift: 3.0.
Optimizer: CAME¶
optimizer: CAMEWrapper
lr: 1e-4
betas: [0.9, 0.999, 0.9999]
epsilon: [1e-30, 1e-16]
weight_decay: 0.0
grad_clip: 0.1
warmup: 2000 steps, constant after
SANA 1.5 uses CAME-8bit — block-wise 8-bit first-order moments, 32-bit second-order. 25% memory reduction vs AdamW.
Resolution Schedule¶
Skip 256px entirely. Start at 512px → finetune to 1024 → 2K → 4K.
Multi-Caption Labeling¶
4 VLMs generate captions (VILA-3B, VILA-13B, InternVL2-8B, InternVL2-26B). CLIP-score sampler selects per iteration.
SFT: 3M samples filtered by CLIP > 25 from 50M pre-training set.¶
SANA 1.5 Key Improvements¶
- Depth-growth paradigm (1.6B → 4.8B): remove last 2 blocks of trained 1.6B → add 40 new blocks with Partial Preservation Init (identity mappings) → 60% fewer training steps
- QK-normalization: RMSNorm on Q,K for stable large-model training
- Depth pruning: block importance metric → prune middle blocks, keep head/tail → quick recovery with ~100 fine-tune steps
- Inference-time scaling: generate N candidates, select best via VILA-Judge (fine-tuned NVILA-2B). GenEval: 0.81 → 0.96 with 2048 candidates. 1.6B + scaling outperforms 4.8B without
Benchmarks¶
| Model | Params | FID↓ | CLIP↑ | GenEval↑ | Speed (A100) |
|---|---|---|---|---|---|
| FLUX-dev | 12.0B | 10.15 | 27.47 | 0.67 | 0.04 img/s |
| SD3-medium | 2.0B | 11.92 | 27.83 | 0.62 | 0.28 img/s |
| Sana-0.6B | 0.6B | 5.81 | 28.36 | 0.64 | 1.7 img/s |
| Sana-1.6B | 1.6B | 5.76 | 28.67 | 0.66 | 1.0 img/s |
| Sana-1.5 1.6B | 1.6B | 5.70 | 29.12 | 0.82 | 1.0 img/s |
| Sana-1.5 4.8B | 4.8B | 5.99 | 29.23 | 0.81 | 0.26 img/s |
Sana-1.6B = 23× faster than FLUX-dev. At 4K: 106× faster (9.6s vs 469s).
SANA-Sprint (Distilled, 1-4 Steps)¶
Hybrid distillation: sCM (continuous consistency) + LADD (latent adversarial).
| Steps | FID | GenEval | Latency (H100) |
|---|---|---|---|
| 1 | 7.04 | 0.72 | 0.1s |
| 2 | 6.76 | — | 0.24s |
| 4 | 6.48 | 0.76 | 0.32s |
Outperforms FLUX-schnell (7.94 FID) while 10× faster. ICCV 2025 Highlight.
SANA-Video¶
Block Causal Linear Attention + Causal Mix-FFN for video. 2B params, 720p, up to 1 min, 16 FPS. 52× faster than Wan-2.1-14B (36s vs 1897s for 5s clip). VBench: 84.05 vs Wan: 83.73.
Key for Temporal Tiling: SANA-Video's causal attention = same mechanism needed for tiles-as-frames.
VRAM¶
| Config | VRAM |
|---|---|
| 0.6B 1024px bf16 | ~16 GB |
| 1.6B 1024px bf16 | ~16-24 GB |
| 4K with VAE tiling | ~22 GB |
| 4K + 4-bit quant + offload | < 8 GB |
| W8A8 quantized 1024px | Very low, 0.37s on 4090 |
Fine-Tuning / LoRA¶
Official support via diffusers train_dreambooth_lora_sana.py.
LoRA targets: attn.to_k, attn.to_q, attn.to_v, attn.to_out.0 + optionally FFN/MLP.
Settings: LR 1e-4, 500 steps, batch 1, grad accum 4, bf16, 3-5 images. Requires peft >= 0.14.0.
ControlNet also supported — ControlNet-Transformer architecture for SANA backbone.
License¶
Code: Apache 2.0. Weights: NSCL v2-custom (check specific terms for commercial use).
Key Links¶
- GitHub: github.com/NVlabs/Sana
- HF: huggingface.co/Efficient-Large-Model/
- Papers: arxiv.org/abs/2410.10629 (SANA), 2501.18427 (1.5), 2503.09641 (Sprint)
- SANA-Video: arxiv.org/abs/2509.24695
- DC-AE: github.com/mit-han-lab/efficientvit
- Training framework: happyin-research/sana-fm/ (local)