LoRA Fine-Tuning for Editing Models¶

★★★★★ Advanced

Practical patterns for applying LoRA adapters to MMDiT-based editing models (Step1X Edit, Qwen-Image-Edit-2511). Demonstrated by PixelSmile — 850 MB LoRA adds entirely new capability (expression control) to a 60 GB base model.

Standard Recipe¶

Target Modules¶

Full MMDiT coverage (maximum expressivity):

target_modules = [
    # Image-stream attention
    "to_q", "to_k", "to_v", "to_out.0",
    # Text-stream attention
    "add_q_proj", "add_k_proj", "add_v_proj", "to_add_out",
    # Image FFN
    "img_mlp.net.0.proj", "img_mlp.net.2",
    # Text FFN
    "txt_mlp.net.0.proj", "txt_mlp.net.2"]

Hyperparameters (PixelSmile reference)¶

Parameter	Value	Notes
Rank	64	Higher than typical (32 works for simpler tasks)
Alpha	128	2x rank is standard
Dropout	0	Common for diffusion LoRA
LR	1e-4	Cosine schedule
Batch	4/GPU	On H200
Epochs	100	Expression task; simpler tasks need fewer
Hardware	4x H200	Or 4x A100-80GB
LoRA size	850 MB	For rank 64, all targets

What to Freeze¶

VAE: always frozen (autoencoder doesn't need task-specific adaptation)
Text encoder: depends on task. PixelSmile trains it (needs new text→expression mapping). For style transfer, often frozen.
Transformer: LoRA applied here is the core of the adaptation

Lighter Alternatives¶

Coverage	Targets	LoRA Size	Use When
Full (PixelSmile)	All attn + all FFN	~850 MB	New behavior (expressions, restoration)
Attention-only	to_q/k/v + add_q/k/v + out	~500 MB	Style transfer, composition
Image-stream only	to_q/k/v + img_mlp	~350 MB	Visual-only changes (no text reinterpretation)

Reducing to attention-only drops ~40% LoRA size. Image-stream-only cuts ~60% but loses text conditioning adaptation.

Training Data Patterns¶

Synthetic Generation (PixelSmile approach)¶

Collect base identities from public datasets
Generate variations via strong API model (Nano Banana Pro)
Annotate with continuous scores via VLM (Gemini 3 Pro)
Train LoRA on synthetic pairs

Full Fine-Tune vs LoRA (MACRO comparison)¶

MACRO uses full fine-tune (not LoRA) because the task (multi-reference at 6-10 images) requires deep architectural adaptation. LoRA is sufficient when the base model already has the capability but needs behavioral steering.

Rule of thumb: if the base model can do the task poorly → LoRA. If it fundamentally can't → full fine-tune.

Loss Functions for Editing LoRA¶

Standard flow matching velocity loss + task-specific auxiliary losses:

Loss	Purpose	Lambda	Used By
Flow Matching (L_FM)	Core generation quality	1.0	All
Identity (ArcFace cosine)	Face preservation	0.1	PixelSmile
Symmetric Contrastive	Distinguish similar outputs	0.05	PixelSmile
LPIPS	Perceptual similarity	varies	RealRestorer

Commercial Viability¶

Base model (Qwen-Image-Edit-2511) is Apache 2.0. Your LoRA weights are entirely yours. This creates a clean commercial path:

Apache 2.0 base (Qwen) + proprietary LoRA = your product

No license contamination from base model. LoRA weights are your IP.

FLUX.2 Klein 9B Edit LoRA Training¶

Klein 9B has native multi-reference conditioning (up to 4 images concatenated as latents) - not IP-Adapter, but architecture-level latent-space concatenation. This makes it suitable for face swap, head swap, and paired before/after edit LoRAs.

Architecture Specifics¶

Parameters: 9B flow model + 8B Qwen3 text embedder (~17B total)
Blocks: 8 DoubleStreamBlocks + 24 SingleStreamBlocks
Text encoder: Qwen3-8B bundled (NOT separate like FLUX.1 T5+CLIP)
Training base: always use FLUX.2-klein-base-9B (undistilled), never the distilled variant
FLUX.1-dev adapters incompatible: hidden_dim mismatch (3072 vs 4096)

Training Frameworks¶

Framework	Strength	Best For
SimpleTuner	`use_flux_kontext: true` for native paired training, `model_flavour: "klein-9b"`	Edit LoRAs with before/after pairs
AI-Toolkit (ostris)	Most tested (50+ run study), YAML config	Style LoRAs, character LoRAs
DiffSynth-Studio	Chinese ecosystem, ready-to-run scripts	Klein + Qwen-Image-Edit

Optimal Hyperparameters¶

Based on Calvin Herbst's 50+ A/B test study:

Parameter	Value	Notes
Linear rank/alpha	128/64	Winner across all models
Conv rank/alpha	64/32	4:2:2:1 ratio proven optimal
LR	1e-4	EXTREME sensitivity - 0.005% change degrades quality
Optimizer	adamw8bit	Best for faces. Avoid adafactor
Weight decay	0.00001	10x lower than default improves grain/texture
Scheduler	cosine_with_warmup (10%)	Direct evidence it helps face quality
Timestep type	sigmoid	Confirmed by 3 independent sources for face work
Noise offset	0.05-0.1	Slight contrast improvement
Precision	bf16 training + fp8 base	fp8 = better film grain, fp32 = better fidelity
Steps	3000-4000	For face edit with progressive dataset refinement
Save every	250 steps	Essential - overfitting is non-monotonic on FLUX

Dataset Format for Paired Training¶

SimpleTuner Kontext mode (recommended):

dataset/
  01_start.png      # Before image (source)
  01_start2.png     # Face reference crop (optional)
  01_end.png        # After image (target)
  01.txt            # "EDITFACE open the eyes wider"

Caption format: trigger word + specific change description. Describe the CHANGE, not the image. Be consistent across dataset.

The most successful published face swap LoRA (BFS) used:

Phase 1 (~2000 steps): 628 initial pairs, broad diversity
Phase 2 (~4000 steps): narrowed to 138 pairs, best skin-tone matches
Phase 3 (fine-tune): narrowed to 76 high-quality pairs

Start broad, filter to best pairs, continue training. Dataset refinement matters more than progressive resolution.

Key insight: dataset quality accounts for ~90% of output quality. Poor datasets produce poor LoRAs regardless of hyperparameters.

Character/Face LoRA Parameters (Community Consensus)¶

Based on 50+ A/B tests (Calvin Herbst), BFL official docs, and Chinese community (Zhihu):

Parameter	Character/Face	Style
Rank	32 (start), 128 for style	128/64/64/32
Alpha	rank or rank/2	64/32
LR	1e-4 (reduce to 5e-5 if unstable)	9.5e-5
Timestep type	sigmoid (face LoRA)	shift (BFL default)
Scheduler	cosine + warmup (10%)	cosine or constant
Steps	800-1500 (Flux 2 face)	3000-7000
Dataset	15-25 images (40+ loses focus)	20-50 images
Class prompts	DO NOT USE (hurts Flux)	N/A
Regularization images	Yes (6:1 ratio)	Optional

Why sigmoid for faces: targets low-noise timesteps that capture fine identity detail. "Shift" (BFL default) biases toward high-noise timesteps which learn coarse structure. Three independent sources confirm sigmoid for character LoRAs.

Flux 2 vs Flux 1 convergence: Flux 2 (including Klein) converges 40-50% faster than Flux 1. Optimal face steps: 800-1500 (Flux 2) vs 1000-2000 (Flux 1). >2000 steps on Flux 2 face = overfitting territory.

Overfitting non-monotonic (CN finding): Checkpoint at epoch 6 may be good, epoch 8 overfit, epoch 10 good again. Save every 250-500 steps, test each checkpoint.

LR Scheduling for Face LoRAs¶

Schedule	When to Use	Evidence
Constant	Short runs (<1500 steps), known good LR	Majority community default
Cosine + warmup	2000-5000 steps, insurance against late overfit	Direct evidence (15 LoRA comparison)
Prodigy (LR=1.0)	Uncertain LR, want auto-adaptation	Strongest for face training specifically
Polynomial (1e-3→1e-4)	"Aggressive start, conservative end"	ExponentialML experiment

Prodigy config:

optimizer: prodigy
learning_rate: 1.0
lr_scheduler: cosine
optimizer_args: decouple=true, use_bias_correction=false, weight_decay=0.05
lr_warmup_steps: 200

Prodigy needs 20-30% more steps than AdamW to achieve same convergence.

ICEdit Diptych Training¶

Alternative approach: instead of before/after image pairs, place both images side-by-side in a diptych and train the model to understand the "edit" from left to right.

[before_image | after_image]  ← concatenated horizontally
Caption: "EDITWORD: [description of change]"

Higher data efficiency than separate pairs. The model learns the visual delta rather than memorizing absolute appearances.

LoRAShop: Training-Free Multi-LoRA Mixing¶

Merge multiple LoRAs without additional training via weight interpolation in parameter space:

# Weighted average of LoRA delta weights
merged_lora = sum(w_i * lora_i for w_i, lora_i in zip(weights, loras))
# Normalize: weights sum to 1.0

Works best when LoRAs target similar content. Quality degrades when mixing LoRAs with very different training objectives (e.g., style + face).

Layer Targeting for Face Edit¶

Approach	Target	Result
Double-stream only	`double_blocks` attention	Highest face similarity in A/B tests
All blocks	All double + single	Baseline
Single blocks only	`single_transformer_blocks 0-23`	DiffSynth default
Selective	Blocks 7, 12, 16, 20	Lighter, good LoRAs

Double-stream blocks handle cross-image interaction (reference latent + main image). For face conditioning, these are most critical.

Face Crop Preprocessing¶

Detection: InsightFace (buffalo_sc) for best diverse-face accuracy, or YOLOv8-face
Padding: 40-50% around bbox for face swap, 80-100% for head swap
Minimum: 512x512 face crop, 1024x1024 preferred
Alignment: prefer eyes-horizontal but Klein tolerates mild rotation
Include some neck/hair/ears for natural editing context

Evaluation Metrics¶

Metric	Target	Purpose
ArcFace CSIM	>0.85	Identity preservation
LPIPS	Lower = better	Visual closeness / edit containment
Background SSIM	~1.0	Non-face region preservation
CLIP Score	Task-dependent	Semantic alignment with prompt

min_snr_gamma: Incompatible¶

min_snr_gamma does NOT work with flowmatch scheduler used by Klein. Skip it entirely. Use noise_offset for contrast control instead.

Gotchas¶

Qwen-Image-Edit requires DiffSynth-Studio framework, not standard diffusers. LoRA loading path differs.
PixelSmile requires a diffusers patch script (patch_qwen_diffusers.sh).
At rank 64+, LoRA training on 60GB base needs 4x 80GB GPUs. Lower rank (16-32) fits on 2x A100.
EMA (exponential moving average) on LoRA weights recommended for stability - PixelSmile uses it.
Data quality matters more than data quantity - synthetic data with VLM scoring outperforms larger messy datasets.
Klein 4B and 9B LoRAs are incompatible - different text encoder sizes (Qwen 3-4B vs Qwen 3-8B).
LR sensitivity is extreme on FLUX - changing by 0.005% can destroy output quality. Use adamw8bit or Prodigy.
Overfitting is non-monotonic - epoch 6 good, 8 bad, 10 good again. Save every checkpoint.
Hair is the #1 failure mode in face editing - hairline, length, shape must match between pairs.
Multi-reference cost scales ~3.5x for 2 refs, ~5x for 3 refs, ~7x for 4 refs.
Class prompts HURT Flux training especially for males and pets.

LoRA Fine-Tuning for Editing Models¶

Standard Recipe¶

Target Modules¶

Hyperparameters (PixelSmile reference)¶

What to Freeze¶

Lighter Alternatives¶

Training Data Patterns¶

Synthetic Generation (PixelSmile approach)¶

Full Fine-Tune vs LoRA (MACRO comparison)¶

Loss Functions for Editing LoRA¶

Commercial Viability¶

FLUX.2 Klein 9B Edit LoRA Training¶

Architecture Specifics¶

Training Frameworks¶

Optimal Hyperparameters¶

Dataset Format for Paired Training¶

Progressive Dataset Refinement (BFS Method)¶

Character/Face LoRA Parameters (Community Consensus)¶

LR Scheduling for Face LoRAs¶

ICEdit Diptych Training¶

LoRAShop: Training-Free Multi-LoRA Mixing¶

Layer Targeting for Face Edit¶

Face Crop Preprocessing¶

Evaluation Metrics¶

min_snr_gamma: Incompatible¶

Gotchas¶

See Also¶

LoRA Fine-Tuning for Editing Models¶

Standard Recipe¶

Target Modules¶

Hyperparameters (PixelSmile reference)¶

What to Freeze¶

Lighter Alternatives¶

Training Data Patterns¶

Synthetic Generation (PixelSmile approach)¶

Full Fine-Tune vs LoRA (MACRO comparison)¶

Loss Functions for Editing LoRA¶

Commercial Viability¶

FLUX.2 Klein 9B Edit LoRA Training¶

Architecture Specifics¶

Training Frameworks¶

Optimal Hyperparameters¶

Dataset Format for Paired Training¶

Progressive Dataset Refinement (BFS Method)¶

Character/Face LoRA Parameters (Community Consensus)¶

LR Scheduling for Face LoRAs¶

ICEdit Diptych Training¶

LoRAShop: Training-Free Multi-LoRA Mixing¶

Layer Targeting for Face Edit¶

Face Crop Preprocessing¶

Evaluation Metrics¶

min_snr_gamma: Incompatible¶

Gotchas¶

See Also¶

Stay updated