Diffusion Inference Acceleration¶

★★★★★ Intermediate

Techniques for accelerating diffusion model inference without quality loss. Covers spectral forecasting, KV cache compression, quantization, and sampling optimization.

Spectrum - Spectral Forecasting¶

Accelerates diffusion sampling by 3.5-4.8x through frequency-domain analysis. Instead of running the full denoiser at every timestep, Spectrum identifies which frequency components actually change between adjacent steps and only recomputes those.

How It Works¶

Analyze frequency spectrum of denoiser output at each step via FFT
Identify static frequencies - many high-frequency components stabilize early in the denoising process
Forecast static components from previous step, only run the model on changing frequencies
Reconstruct full output by combining forecasted + computed components

The key insight: during diffusion sampling, low-frequency structure (composition, color) settles first, while high-frequency detail (texture, edges) continues evolving. But even high-frequency components become predictable in later steps.

Performance¶

Model	Speedup	Quality Impact
FLUX.1	3.5-4.79x	No measurable degradation
HunyuanVideo + FLUX.1	3.5x	No measurable degradation
Wan 2.1-14B	4.67x	No measurable degradation
SDXL	Supported	No measurable degradation

ComfyUI Integration¶

Available as ComfyUI-Spectrum-Proper nodes. Supports FLUX, WAN, and SDXL architectures. Drop-in replacement for standard sampling - no workflow restructuring needed.

# Conceptual usage (diffusers)
from spectrum import SpectralForecaster

forecaster = SpectralForecaster(
    model=pipe.transformer,
    threshold=0.1,  # frequency change threshold
)

# During sampling loop, forecaster decides per-step
# which frequency bands need full model evaluation

TriAttention - KV Cache Compression¶

Compresses KV cache memory by 10.7x through trigonometric representation of attention matrices. Primarily targets LLM inference but applicable to any transformer with KV caching.

Mechanism¶

Standard KV cache stores full K and V tensors for all previous tokens. TriAttention represents these via trigonometric decomposition:

Encode K/V vectors into compact trigonometric coefficients
Reconstruct attention weights from coefficients on-the-fly
10.7x memory reduction with 2.5x generation speed improvement

Practical Integration¶

Available as a vLLM plugin. For diffusion transformers (DiT, MMDiT), KV cache compression matters most during:

Tiled inference with cross-tile attention (see Temporal Tiling)
Long-context conditioning (multiple reference images)
Video generation with temporal attention

TurboQuant - Extreme KV Cache Quantization¶

Compresses KV cache to 3 bits per element without model retraining. Uses random rotation of vectors + scalar quantization.

Algorithm¶

1. Apply random orthogonal rotation R to K/V vectors
2. Quantize rotated vectors to 3-bit scalars (table lookup)
3. At attention time: dequantize, apply R^T, compute attention normally

~5-6x memory compression at 99.5% attention fidelity.

Implementation Status¶

# llama.cpp fork (working)
git clone https://github.com/TheTom/llama-cpp-turboquant
cd llama-cpp-turboquant && make -j
./llama-server -m model.gguf --cache-type-k turbo3 --cache-type-v turbo3

Runtime	Status
llama.cpp fork	Working, CPU, all tests pass
llama.cpp mainline	Merge pending
vLLM	Feature request open
Apple Silicon	4.6x compression, matches q8_0 speed

Practical Notes¶

Best gains on long contexts (128K+). At 4K context, difference is minimal
Base weight quantization affects quality: Q8_0+ gives best results, Q4_K_M degrades more
QJL correction (second stage) can worsen quality through standard attention - use TurboQuant_mse variant
The rotation trick is simple (table lookup), which is why implementations appeared within hours of publication

Nunchaku - Weight Quantization for DiT¶

NVFP4 and INT8 quantization specifically optimized for diffusion transformers.

FLUX Klein 9B Quantization¶

# Nunchaku quantized Klein inference
# 40-55% VRAM reduction vs bf16
from nunchaku import load_quantized_model

model = load_quantized_model(
    "flux2-klein-9b-nunchaku",
    dtype="nvfp4"  # or "int8"
)

Precision	VRAM (9B)	Quality	Speed
bf16	~18 GB	Baseline	1.0x
FP8	~10 GB	Near-identical	~1.1x
NVFP4	~8 GB	Slight softening	~1.3x
INT8 (W8A8)	~9 GB	Near-identical	~1.2x

FP8 Klein maintains desirable grain structure (16mm film look). Non-FP8 produces cleaner output (35mm film character).

Sampling Optimization¶

Step Reduction for Distilled Models¶

Distilled models (FLUX Klein distilled, SANA-Sprint) are step-optimized:

Model	Design Steps	Usable Range	Quality Cliff
Klein 9B distilled	4	4-8	>8 degrades
SANA-Sprint	1-4	1-4	N/A (designed for 1-step)
FLUX.1-schnell	4	4-8	>8 degrades

Using 50+ steps on a 4-step distilled model wastes compute and can actively degrade quality.

Sampler Compute Efficiency¶

Different samplers have different compute cost per step:

Sampler	Compute/Step	Notes
euler	1x	Baseline, most stable
res_2s	~2x	Effectively doubles work per step
dpmpp_2m	~1.5x	Good quality/speed balance
euler_ancestral	~1x	Adds stochasticity

res_2s at 4 steps is computationally equivalent to euler at 8 steps. Useful for fixing anatomy issues without increasing step count.

Combined Acceleration Stack¶

For maximum throughput:

1. Spectrum spectral forecasting     (3-5x speedup)
2. Nunchaku FP8/NVFP4 quantization  (1.1-1.3x speedup)
3. Optimal step count                (no wasted steps)
4. VAE tiling for decode             (enables high-res)

These stack multiplicatively. FLUX.1 with Spectrum + FP8 achieves ~4-6x total acceleration.

Gotchas¶

Spectrum + LoRA interaction: spectral forecasting assumes stable frequency patterns. Heavy LoRA influence can shift which frequencies change per step, potentially requiring threshold tuning. Test with your specific LoRA before production use.
Quantization + fine detail: NVFP4 quantization slightly softens fine textures (hair strands, fabric weave). For product photography where texture detail is critical, FP8 is the safe choice over NVFP4.
Distilled model CFG: distilled models have guidance baked in. Setting CFG above 1.0-1.5 produces "deep-fried" artifacts. Klein 9B distilled requires CFG=1.0 specifically.
TurboQuant on short contexts: at 4K tokens or fewer, the overhead of rotation/dequantization can negate memory savings. Only beneficial for 32K+ contexts.