Skip to content

Diffusion LoRA Training

Advanced

Practical patterns for LoRA fine-tuning of diffusion models (FLUX Klein 9B, SANA, SDXL). Covers dataset preparation, training tools, hyperparameters, and multi-trainer comparison.

Dataset Preparation

Size Guidelines

Task Minimum Recommended Max Useful
Single subject (DreamBooth) 3-5 5-10 15
Style (photography style) 15-20 25-30 50
Domain (product category) 30-50 50-100 200
Complex domain + variations 50+ 100-200 500

Caption Quality

Detailed captions are more important than data quantity. Include: - Material, texture, lighting setup - Camera angle, focal length, depth of field - Background description, environment - Style attributes specific to the domain

Good: "sks jewelry photo, 18k gold engagement ring with oval cut diamond,
       soft studio lighting, dark velvet background, macro photography,
       sharp focus on gemstone facets"

Bad:  "ring on dark background"

Trigger Words

Use rare token (e.g., "sks") as trigger word for DreamBooth. Define one per LoRA style. The token must not collide with existing vocabulary meanings.

Image Requirements

  • Resolution: 1024x1024 minimum (match training resolution)
  • Format: PNG preferred (lossless), JPG acceptable if high quality
  • Variety: mix angles, lighting, compositions within the domain
  • Quality: curated > quantity. Remove blurry, poorly lit, atypical examples.

FLUX Klein 9B LoRA Training

Critical Rules (From 50+ Training Runs)

Parameter Value Notes
Network dims 128/64/64/32 linear/linear_alpha/conv/conv_alpha (4:2:1:1 ratio)
Weight decay 0.00001 1/10th default for balanced analog texture
Learning rate DO NOT CHANGE Even 0.005% change destroys image quality on Flux architecture
Optimal steps ~7,000 3K = too raw, beyond 7K = anatomical distortion
Trigger word One per style Required for activation

Klein-Specific Notes

  • Train on base model (klein-base-9b), NOT distilled
  • 9B uses qwen_3_8b text encoder; 4B uses qwen_3_4b
  • 4B LoRA NOT compatible with 9B and vice versa
  • FP8 transformer saves significant VRAM during 9B training

SANA LoRA Training

accelerate launch train_dreambooth_lora_sana.py \
  --pretrained_model_name_or_path=Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers \
  --instance_data_dir=data/dreambooth/jewelry \
  --output_dir=trained-sana-lora \
  --mixed_precision=bf16 \
  --instance_prompt="a photo of sks jewelry" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --lr_scheduler=constant \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks jewelry on black velvet" \
  --validation_epochs=25 \
  --seed=0 \
  --cache_latents \
  --offload

SANA Training Parameters

Parameter Value Notes
Learning rate 1e-4 Standard for SANA LoRA
Max steps 500 Fast convergence (vs 7K for FLUX)
Resolution 1024 Native SANA resolution
Effective batch 4 batch 1 x grad_accum 4
Precision bf16 Required
Optimizer 8-bit AdamW Memory efficient
LR scheduler constant No warmup needed

LoRA Configuration

Target layers: attn.to_k, attn.to_q, attn.to_v for attention-only LoRA.

LoRA scaling = alpha / rank: - alpha = rank: scaling factor 1.0 (standard) - alpha < rank: subtler modifications - alpha > rank: amplified effect

Memory Optimizations

  • --offload: CPU offload text encoder + VAE when not in use
  • --cache_latents: precompute VAE latents, remove VAE from GPU
  • --use_8bit_adam: bitsandbytes 8-bit optimizer

Two-Stage Domain Training

Stage 1: Domain LoRA - teaches what the domain looks like: - 20-50 high-quality domain images with detailed captions - Standard T2I DreamBooth/LoRA training - Learns materials, lighting, textures, compositions

Stage 2: Task-specific - teaches what to DO: - Build on domain LoRA or merge it - Paired data for editing tasks (before/after) - InstructPix2Pix-style training for edit LoRAs

For txt2img domain generation, Stage 1 alone is sufficient.

Training Tool Comparison

Feature diffusion-pipe ai-toolkit SimpleTuner kohya_ss
Pipeline parallelism DeepSpeed No No No
Multi-GPU Native hybrid Limited Data parallel Data parallel
LoRA format (Klein) ComfyUI native Diffusers Diffusers Diffusers
Progressive resolution No No Yes No
Text encoder LoRA No (pre-cache) Yes Yes Yes
Masked training Yes No No No
LR scheduling warmup only Full set Full set Full set
Prodigy optimizer Not documented Yes Yes Yes
Resume training Pain-free Yes Yes Yes
Windows WSL2 only Native Native Native
Config format TOML YAML JSON + env TOML

diffusion-pipe Unique Features

Masked training: provide binary mask per image - white regions train, black regions are masked from loss. Ideal for face-only training (train on face, ignore background/clothing).

Pipeline parallelism: Klein 9B splits across multiple GPUs via DeepSpeed. With pipeline_stages=2, model divides across 2 GPUs.

ComfyUI-native LoRA format: no conversion needed for Klein inference in ComfyUI.

diffusion-pipe Klein Config

[model]
type = 'flux2'
diffusion_model = '/path/to/flux-2-klein-base-9b.safetensors'
vae = '/path/to/flux2-vae.safetensors'
text_encoders = [
  {path = '/path/to/qwen_3_8b.safetensors', type = 'flux2'}
]
dtype = 'bfloat16'
diffusion_model_dtype = 'float8'
timestep_sample_method = 'logit_normal'
shift = 3

[adapter]
type = 'lora'
rank = 32
dtype = 'bfloat16'

[optimizer]
type = 'AdamW8bitKahan'
lr = 5e-5
betas = [0.9, 0.99]
weight_decay = 0.01

Dataset Config (diffusion-pipe)

resolutions = [1024]
enable_ar_bucket = true
min_ar = 0.5
max_ar = 2.0
num_ar_buckets = 7
num_repeats = 1

Overfitting Detection

Symptom Cause Fix
Training images reproduced exactly Too many steps / too few images Reduce steps, add data diversity
Anatomical distortion Training past optimal point Stop at ~7K steps (FLUX)
Color/style collapse LR too high Reduce LR (carefully for FLUX)
Prompt ignored Overfit to training captions More diverse captions, lower rank
Artifacts at high LoRA strength Training instability Lower alpha, add weight decay

Dependencies

# SANA LoRA (diffusers)
pip install diffusers[training] peft>=0.14.0 accelerate bitsandbytes wandb
# From source for latest:
pip install git+https://github.com/huggingface/diffusers.git

# diffusion-pipe
pip install deepspeed  # heavy dependency

Gotchas

  • FLUX LR sensitivity: the FLUX architecture is extremely sensitive to learning rate changes. The documented "DO NOT CHANGE" warning comes from 50+ training runs showing even 0.005% deviation destroys output quality. This is architecture-specific - SANA and SDXL are far more forgiving.
  • Klein 4B/9B LoRA incompatibility: different text encoders (4B vs 8B Qwen) mean LoRAs trained on one model cannot be loaded on the other. Always verify model variant before training.
  • diffusion-pipe on Windows: requires WSL2. Direct Windows execution is not supported due to DeepSpeed dependency.
  • Cache latents for SANA: failing to use --cache_latents keeps the VAE on GPU throughout training, wasting 2-4 GB VRAM that could be used for larger batch size or higher rank.

See Also