Skip to content

Diffusion LoRA Training

Expert

Practical patterns for LoRA fine-tuning of diffusion models (FLUX Klein 9B, SANA, SDXL). Covers dataset preparation, training tools, hyperparameters, and multi-trainer comparison.

Dataset Preparation

Size Guidelines

Task Minimum Recommended Max Useful
Single subject (DreamBooth) 3-5 5-10 15
Style (photography style) 15-20 25-30 50
Domain (product category) 30-50 50-100 200
Complex domain + variations 50+ 100-200 500

Caption Quality

Detailed captions are more important than data quantity. Include: - Material, texture, lighting setup - Camera angle, focal length, depth of field - Background description, environment - Style attributes specific to the domain

Good: "sks jewelry photo, 18k gold engagement ring with oval cut diamond,
       soft studio lighting, dark velvet background, macro photography,
       sharp focus on gemstone facets"

Bad:  "ring on dark background"

Trigger Words

Use rare token (e.g., "sks") as trigger word for DreamBooth. Define one per LoRA style. The token must not collide with existing vocabulary meanings.

Image Requirements

  • Resolution: 1024x1024 minimum (match training resolution)
  • Format: PNG preferred (lossless), JPG acceptable if high quality
  • Variety: mix angles, lighting, compositions within the domain
  • Quality: curated > quantity. Remove blurry, poorly lit, atypical examples.

FLUX Klein 9B LoRA Training

Critical Rules (From 50+ Training Runs)

Parameter Value Notes
Network dims 128/64/64/32 linear/linear_alpha/conv/conv_alpha (4:2:1:1 ratio)
Weight decay 0.00001 1/10th default for balanced analog texture
Learning rate DO NOT CHANGE Even 0.005% change destroys image quality on Flux architecture
Optimal steps ~7,000 3K = too raw, beyond 7K = anatomical distortion
Trigger word One per style Required for activation

Klein-Specific Notes

  • Train on base model (klein-base-9b), NOT distilled
  • 9B uses qwen_3_8b text encoder; 4B uses qwen_3_4b
  • 4B LoRA NOT compatible with 9B and vice versa
  • FP8 transformer saves significant VRAM during 9B training

SANA LoRA Training

accelerate launch train_dreambooth_lora_sana.py \
  --pretrained_model_name_or_path=Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers \
  --instance_data_dir=data/dreambooth/jewelry \
  --output_dir=trained-sana-lora \
  --mixed_precision=bf16 \
  --instance_prompt="a photo of sks jewelry" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --lr_scheduler=constant \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks jewelry on black velvet" \
  --validation_epochs=25 \
  --seed=0 \
  --cache_latents \
  --offload

SANA Training Parameters

Parameter Value Notes
Learning rate 1e-4 Standard for SANA LoRA
Max steps 500 Fast convergence (vs 7K for FLUX)
Resolution 1024 Native SANA resolution
Effective batch 4 batch 1 x grad_accum 4
Precision bf16 Required
Optimizer 8-bit AdamW Memory efficient
LR scheduler constant No warmup needed

LoRA Configuration

Target layers: attn.to_k, attn.to_q, attn.to_v for attention-only LoRA.

LoRA scaling = alpha / rank: - alpha = rank: scaling factor 1.0 (standard) - alpha < rank: subtler modifications - alpha > rank: amplified effect

Memory Optimizations

  • --offload: CPU offload text encoder + VAE when not in use
  • --cache_latents: precompute VAE latents, remove VAE from GPU
  • --use_8bit_adam: bitsandbytes 8-bit optimizer

Two-Stage Domain Training

Stage 1: Domain LoRA - teaches what the domain looks like: - 20-50 high-quality domain images with detailed captions - Standard T2I DreamBooth/LoRA training - Learns materials, lighting, textures, compositions

Stage 2: Task-specific - teaches what to DO: - Build on domain LoRA or merge it - Paired data for editing tasks (before/after) - InstructPix2Pix-style training for edit LoRAs

For txt2img domain generation, Stage 1 alone is sufficient.

Training Tool Comparison

Feature diffusion-pipe ai-toolkit SimpleTuner kohya_ss
Pipeline parallelism DeepSpeed No No No
Multi-GPU Native hybrid Limited Data parallel Data parallel
LoRA format (Klein) ComfyUI native Diffusers Diffusers Diffusers
Progressive resolution No No Yes No
Text encoder LoRA No (pre-cache) Yes Yes Yes
Masked training Yes No No No
LR scheduling warmup only Full set Full set Full set
Prodigy optimizer Not documented Yes Yes Yes
Resume training Pain-free Yes Yes Yes
Windows WSL2 only Native Native Native
Config format TOML YAML JSON + env TOML

diffusion-pipe Unique Features

Masked training: provide binary mask per image - white regions train, black regions are masked from loss. Ideal for face-only training (train on face, ignore background/clothing).

Pipeline parallelism: Klein 9B splits across multiple GPUs via DeepSpeed. With pipeline_stages=2, model divides across 2 GPUs.

ComfyUI-native LoRA format: no conversion needed for Klein inference in ComfyUI.

diffusion-pipe Klein Config

[model]
type = 'flux2'
diffusion_model = '/path/to/flux-2-klein-base-9b.safetensors'
vae = '/path/to/flux2-vae.safetensors'
text_encoders = [
  {path = '/path/to/qwen_3_8b.safetensors', type = 'flux2'}
]
dtype = 'bfloat16'
diffusion_model_dtype = 'float8'
timestep_sample_method = 'logit_normal'
shift = 3

[adapter]
type = 'lora'
rank = 32
dtype = 'bfloat16'

[optimizer]
type = 'AdamW8bitKahan'
lr = 5e-5
betas = [0.9, 0.99]
weight_decay = 0.01

Dataset Config (diffusion-pipe)

resolutions = [1024]
enable_ar_bucket = true
min_ar = 0.5
max_ar = 2.0
num_ar_buckets = 7
num_repeats = 1

Klein 9B Character / Identity LoRA

Training a LoRA to preserve a specific person's identity requires a different approach from style LoRA.

Critical: Optimizer Choice

adafactor FAILS on Klein 9B for character training. Use adamw8bit.

# diffusion-pipe
[optimizer]
type = 'AdamW8bitKahan'
lr = 1e-4
betas = [0.9, 0.999]
weight_decay = 0.01
# ai-toolkit / SimpleTuner equivalent
optimizer: adamw8bit
lr: 1e-4

Minimal Dataset Recipe (RunComfy standard)

Parameter Value Notes
Images 8-12 Curated, varied angles + lighting
Steps 2K-4K Character identity; style = 7K
Repeats 90-120 Compensates for small dataset
Rank 16-32 RunComfy conservative; Herbst uses 128/64
LR 1e-4 Same sensitivity warning as style LoRA
Optimizer adamw8bit adafactor will fail

Alternative recipe (Herbst, 50+ runs): - Rank 128/64/64/32, weight_decay 0.00001, 7K steps — universal winner for both style and character quality

Caption Format for Identity LoRAs

trigger_token, [scene description], [lighting], [pose]

Key rules: - Put trigger word FIRST - Describe scene, background, lighting, pose - Do NOT describe face features (eye color, nose shape, face shape) — this forces those features INTO the LoRA weights rather than being excluded - The model learns what's NOT described; face features must stay in LoRA, not float free in text space

Good: "john_doe portrait, standing in office, warm overhead lighting, three-quarter view"
Bad:  "john_doe, blue eyes, sharp jawline, dark hair, studio lighting"  # locks features to text

Proportion Issues Fix (Big Heads Problem)

Cause: dataset dominated by headshots (90%) + distillation artifacts.

Fix: 1/3 dataset rule

33% headshots (shoulders up)
33% half-body
33% full-body

Caption composition explicitly:

"john_doe, full body shot, standing, arms visible, white studio backdrop"
"john_doe, half body, casual seated pose, table in foreground"

PuLID-Flux2 for Inference Enhancement

Apply after training to boost identity consistency without retraining: - InsightFace + EVA-CLIP → identity tokens injected into Klein double blocks - Supports up to 8 reference images (multi-reference via 3D RoPE time offsets) - Does not require paired training data

Auxiliary Losses for LoRA Training

Additional loss terms beyond standard diffusion loss for specialized training goals.

ArcFace Identity Loss

Preserves facial identity during edit LoRA training:

from insightface.app import FaceAnalysis

face_app = FaceAnalysis(providers=['CUDAExecutionProvider'])
face_app.prepare(ctx_id=0)

def arcface_loss(generated, target, weight=0.1):
    gen_embedding = face_app.get(generated)[0].embedding
    target_embedding = face_app.get(target)[0].embedding
    cos_sim = F.cosine_similarity(gen_embedding, target_embedding, dim=0)
    return weight * (1 - cos_sim)

Use when: edit LoRA must preserve identity across transformations (relight, restyle).

DreamBooth Prior Preservation

Prevents language drift (model forgets general concept while learning specific instance):

# In ai-toolkit
prior_preservation: true
prior_preservation_weight: 1.0
class_prompt: "a person"  # generic class prompt
num_class_images: 200

Prior images generated by base model before training begins. Loss = main loss + prior_weight × class_loss.

Masked Loss (diffusion-pipe)

Train only on specific regions — essential for face LoRA where background should not influence weights:

# diffusion-pipe dataset config
[[dataset.image_config]]
path = "data/faces/"
masks_path = "data/masks/"  # white = train region, black = ignore
masked_loss = true

Mask generation: face segmentation (BiSeNet, SAM2) → erode 10px → apply as binary mask.

B-LoRA: Content vs Style Disentanglement

Target specific transformer block groups to separate content (structure) from style:

# B-LoRA: content preservation blocks
content_blocks = [f"transformer.single_transformer_blocks.{i}" for i in range(0, 12)]
# style-only blocks
style_blocks = [f"transformer.double_transformer_blocks.{i}" for i in range(0, 8)]

Training content blocks = structural/identity LoRA. Training style blocks = style-only LoRA.

Concept Slider LoRA

Train on contrast pairs (attribute A vs. attribute B) to create a directional slider:

# Dataset: paired images, one with attribute, one without
# positive_prompts = ["sharp details", ...]
# negative_prompts = ["soft details", ...]
# Model learns the direction in latent space
# Inference: LoRA weight -2 to +2 controls attribute strength

LR Scheduling for Diffusion LoRA

The learning rate VALUE matters far more than the schedule shape for FLUX/Klein.

Schedule Options

Schedule Use Case Notes
constant Klein 9B (recommended) Most controlled; Herbst default
constant_with_warmup General purpose 100-200 warmup steps
cosine "Safe upgrade" from constant Gradual decay, rarely hurts
Prodigy When unsure of LR value Self-tuning, sidesteps sensitivity
warmup_only Not standard Rarely used standalone

Community consensus (Calvin Herbst, 50+ Klein runs): constant schedule, because Klein's LR sensitivity makes any decay risky. Better to nail the LR value than rely on schedule to compensate.

Chinese community consensus (Zhihu/Bilibili practitioners): For face/character LoRA, adamw8bit + 1e-4 + constant is the safe combination. Avoid 3e-4 + AdamW8bit - causes unrealistic facial shadows and wrinkles.

Steps vs Dataset Size (Chinese Community Rule)

Dataset Size Steps Rationale
10-20 images 1K-2K "images × 100" rule widely cited
20-30 images 2K-3K Sweet spot for face LoRA
30-50 images 3K-4K >50 images at 4K steps = overfitting risk

Critical finding: FLUX training curves are non-monotonic. Epoch 6 may look good, epoch 8 overfitted, epoch 10 normal again. Always save every checkpoint and evaluate empirically - don't trust the curve.

Prodigy: Self-Tuning LR

[optimizer]
type = 'prodigy'
# No lr needed — Prodigy estimates it
d_coef = 1.0
weight_decay = 0.01
safeguard_warmup = true

Useful when starting a new architecture without established LR priors. May still overfit if steps are too many.

Prodigy gotcha: last few epochs can cause hand deformities. Save earlier checkpoints and test at epoch N-2 before accepting the final.

DiffSynth-Studio Official Klein 9B Config

All Klein variants (4B, 9B, 9B-Base) use the same base config in DiffSynth-Studio:

learning_rate: 1e-4
epochs: 5
lora_rank: 32
max_pixels: 1048576   # = 1024x1024
dataset_repeat: 50
gradient_checkpointing: true
lora_base_model: "dit"
target_modules:
  - to_q, to_k, to_v, to_out.0
  - add_q_proj, add_k_proj, add_v_proj, to_add_out
  - linear_in, linear_out, to_qkv_mlp_proj
  - single_transformer_blocks.*.attn.to_out

Overfitting Detection

Symptom Cause Fix
Training images reproduced exactly Too many steps / too few images Reduce steps, add data diversity
Anatomical distortion Training past optimal point Stop at ~7K steps (FLUX)
Color/style collapse LR too high Reduce LR (carefully for FLUX)
Prompt ignored Overfit to training captions More diverse captions, lower rank
Artifacts at high LoRA strength Training instability Lower alpha, add weight decay

Dependencies

# SANA LoRA (diffusers)
pip install diffusers[training] peft>=0.14.0 accelerate bitsandbytes wandb
# From source for latest:
pip install git+https://github.com/huggingface/diffusers.git

# diffusion-pipe
pip install deepspeed  # heavy dependency

Gotchas

  • FLUX LR sensitivity: the FLUX architecture is extremely sensitive to learning rate changes. The documented "DO NOT CHANGE" warning comes from 50+ training runs showing even 0.005% deviation destroys output quality. This is architecture-specific - SANA and SDXL are far more forgiving.
  • Klein 4B/9B LoRA incompatibility: different text encoders (4B vs 8B Qwen) mean LoRAs trained on one model cannot be loaded on the other. Always verify model variant before training.
  • diffusion-pipe on Windows: requires WSL2. Direct Windows execution is not supported due to DeepSpeed dependency.
  • Cache latents for SANA: failing to use --cache_latents keeps the VAE on GPU throughout training, wasting 2-4 GB VRAM that could be used for larger batch size or higher rank.
  • adafactor + Klein = broken character LoRA: adafactor adaptive scaling does not converge properly with Klein 9B's architecture for identity training. Symptom: identity collapses to average face by 1K steps. Fix: switch to adamw8bit.
  • Face feature captions destroy identity LoRA: describing specific face features (eye color, face shape) in captions teaches the model to associate those features with text tokens rather than embedding them in LoRA weights. Result: LoRA has low identity fidelity. Always caption scene/context/pose, never face attributes.
  • Big heads from headshot-only datasets: Klein (and diffusion models generally) develop distorted proportions when trained on 90%+ headshots. Always mix headshot/half-body/full-body at roughly equal ratios.
  • Training on distilled Klein: edit LoRAs and character LoRAs must train on base model (klein-base-9b), not distilled variants. Distilled models have simplified conditioning that disrupts paired training.
  • Class prompt hurts FLUX character LoRA: unlike DreamBooth for SDXL where class prompt (e.g., "a photo of a person") helps prior preservation, using class prompts for FLUX/Klein character LoRA degrades results — especially for male subjects and pets. FLUX already has strong identity priors; class prompt fights against the trigger word learning.
  • No trigger word needed for face LoRA: Chinese community finding — trigger word has no detectable impact on high-precision face LoRA quality. FLUX handles person identity well without explicit token. Using a trigger word is optional, not required.
  • Regularization images greatly improve generalization: for face/character LoRAs, including regularization images (similar subjects without the identity) significantly improves out-of-distribution generalization. Highly recommended even for small datasets (10-30 images).

See Also