Skip to content

FLUX.2 Klein 9B Architecture

Intermediate

Deep reference for the FLUX.2 Klein 9B model internals: transformer structure, text encoding, VAE, and editing mechanism. Covers differences from FLUX.1 and implications for LoRA training.

Model Variants

Variant Params Text Encoder License Use Case
Klein Base 9B 9B Qwen3 8B Non-Commercial Training LoRAs, editing
Klein 4B 4B Qwen3 4B Apache 2.0 Commercial inference
Klein KV 9B Qwen3 8B Non-Commercial Reduced KV cache variant
Klein FP8 9B (quantized) Qwen3 8B Non-Commercial Low-VRAM training
Klein nvfp4 9B (quantized) Qwen3 8B Non-Commercial NVIDIA Blackwell inference
Klein Distilled 9B Qwen3 8B Non-Commercial Fast inference (4 steps)

Critical incompatibility: 4B LoRAs cannot load on 9B and vice versa. Different hidden dims (4B uses smaller) and different text encoder shapes.

Transformer Block Structure

Klein 9B uses a MMDiT-style architecture with two block types:

Input (noisy latent + reference images)
8 × Double Blocks       ← process image+text jointly
24 × Single Blocks      ← image-only refinement
Output (denoised latent)
Component Value Notes
hidden_dim 4096 vs FLUX.1's 3072
double_blocks 8 joint image+text attention
single_blocks 24 image-only attention
rope_theta 2000 3D RoPE for spatial+temporal
max_seq_len 52000 handles long reference sequences

FLUX.1 vs FLUX.2 comparison:

Aspect FLUX.1 FLUX.2 Klein
hidden_dim 3072 4096
double_blocks varies by variant 8
single_blocks varies 24
Text encoder T5-XXL (4B) Qwen3 8B
Edit mechanism Separate Fill model Unified (concat in sequence)
LoRA compatibility Not compatible Not compatible with FLUX.1

Text Encoding (Qwen3 8B)

Klein uses Qwen3 8B as its sole text encoder (replacing T5 + CLIP in older FLUX):

Qwen3 8B (36 transformer layers)
Text → Hidden states at layers 9, 18, 27 → Concatenated
Concatenated features → Projected → Cross-attention in double blocks
Parameter Value
Encoder layers 36
Features extracted Layers 9, 18, 27 (early/mid/late)
Hidden size 4096
Vocab Qwen3 tokenizer

The multi-layer extraction gives the model coarse semantic (early layers) + refined semantic (late layers) simultaneously.

VAE

Image (3 channels, H×W)
↓ encode
Latent (32 channels, H/8 × W/8)
↓ patch packing (2×2 = 4× compression)
Token sequence (128 channels per patch)
Aspect Value
Latent channels 32
Spatial compression
Patch packing 2×2 → 128 channels
Total compression 32× (8× spatial × 4× patch)

The 32-channel VAE (vs 4-channel in SDXL) enables richer latent representation but requires significantly more VRAM.

Edit Mechanism (3D RoPE Time Offsets)

Klein's core editing innovation: reference images share the same sequence space as the output, differentiated only by temporal position encoding:

Sequence: [ref_image_1 | ref_image_2 | ... | output_tokens]
Time (t):      1/2           1/4              0
# Simplified: reference tokens at t=0.5 (halfway through diffusion)
# Output tokens at t=0 (start of denoising)
# 3D RoPE encodes: (x_position, y_position, time)

rope_embed = RoPE3D(x=pos_x, y=pos_y, t=time_offset)
# ref_img1: t=0.5
# ref_img2: t=0.25 (if 2nd reference)
# output:   t=0.0

Why this matters for training: - Training cost scales with reference count (each ref ~doubles sequence length) - 1 reference: ~2.11× base training cost - Multiple references: sequence fills max_seq_len=52000 quickly - zero_cond_t: parameter for ablating reference conditioning during training

Training Cost Multipliers

References Approx. Cost Multiplier
0 (text-only) 1.0×
1 ~2.11×
2 ~3.5×
4+ Approaches max_seq_len limit

LoRA Training Implications

Where to target for different LoRA types:

LoRA Goal Target Blocks Rationale
Style only double_transformer_blocks Joint image+text attention = style
Structure/identity single_transformer_blocks 0-12 Early image-only = spatial structure
Full-quality Both block types, all layers 128/64/64/32 universal recipe

Why FLUX.1 adapters don't work on Klein: - hidden_dim 3072 (FLUX.1) vs 4096 (Klein): all weight shapes mismatch - Different text encoder: T5 vs Qwen3 - Different attention heads and projection sizes

Distilled vs Base

Aspect Base Distilled
Steps 20-24 4
CFG 3.5-5.0 1.0 (CFG-free)
Training LoRA on Required Not recommended
Edit conditioning Full Simplified
Quality ceiling Higher Faster, lower ceiling

Training LoRAs on distilled models: paired edit training breaks because distilled CFG-free inference path differs from training objective.

Gotchas

  • FLUX.1 LoRAs are NOT compatible: weight shapes differ (3072 vs 4096 hidden_dim). Loading a FLUX.1 LoRA on Klein will silently fail or produce corrupted output.
  • 9B base required for training: Klein-distilled lacks the full conditioning path needed for edit LoRA or character LoRA training. Always train on flux-2-klein-base-9b.safetensors.
  • 32-channel VAE VRAM cost: the richer VAE requires significantly more VRAM for encode/decode vs SDXL. Use --cache_latents to precompute and remove VAE from training GPU.
  • max_seq_len=52000 limit: adding many reference images quickly exhausts the sequence limit, causing OOM. With 4+ references at 1024px, hit limit on sequences >50K tokens.

See Also