LoRA Fine-Tuning for Editing Models¶
Practical patterns for applying LoRA adapters to MMDiT-based editing models (Step1X Edit, Qwen-Image-Edit-2511). Demonstrated by PixelSmile — 850 MB LoRA adds entirely new capability (expression control) to a 60 GB base model.
Standard Recipe¶
Target Modules¶
Full MMDiT coverage (maximum expressivity):
target_modules = [
# Image-stream attention
"to_q", "to_k", "to_v", "to_out.0",
# Text-stream attention
"add_q_proj", "add_k_proj", "add_v_proj", "to_add_out",
# Image FFN
"img_mlp.net.0.proj", "img_mlp.net.2",
# Text FFN
"txt_mlp.net.0.proj", "txt_mlp.net.2",
]
Hyperparameters (PixelSmile reference)¶
| Parameter | Value | Notes |
|---|---|---|
| Rank | 64 | Higher than typical (32 works for simpler tasks) |
| Alpha | 128 | 2x rank is standard |
| Dropout | 0 | Common for diffusion LoRA |
| LR | 1e-4 | Cosine schedule |
| Batch | 4/GPU | On H200 |
| Epochs | 100 | Expression task; simpler tasks need fewer |
| Hardware | 4x H200 | Or 4x A100-80GB |
| LoRA size | 850 MB | For rank 64, all targets |
What to Freeze¶
- VAE: always frozen (autoencoder doesn't need task-specific adaptation)
- Text encoder: depends on task. PixelSmile trains it (needs new text→expression mapping). For style transfer, often frozen.
- Transformer: LoRA applied here is the core of the adaptation
Lighter Alternatives¶
| Coverage | Targets | LoRA Size | Use When |
|---|---|---|---|
| Full (PixelSmile) | All attn + all FFN | ~850 MB | New behavior (expressions, restoration) |
| Attention-only | to_q/k/v + add_q/k/v + out | ~500 MB | Style transfer, composition |
| Image-stream only | to_q/k/v + img_mlp | ~350 MB | Visual-only changes (no text reinterpretation) |
Reducing to attention-only drops ~40% LoRA size. Image-stream-only cuts ~60% but loses text conditioning adaptation.
Training Data Patterns¶
Synthetic Generation (PixelSmile approach)¶
- Collect base identities from public datasets
- Generate variations via strong API model (Nano Banana Pro)
- Annotate with continuous scores via VLM (Gemini 3 Pro)
- Train LoRA on synthetic pairs
Full Fine-Tune vs LoRA (MACRO comparison)¶
MACRO uses full fine-tune (not LoRA) because the task (multi-reference at 6-10 images) requires deep architectural adaptation. LoRA is sufficient when the base model already has the capability but needs behavioral steering.
Rule of thumb: if the base model can do the task poorly → LoRA. If it fundamentally can't → full fine-tune.
Loss Functions for Editing LoRA¶
Standard Flow Matching velocity loss + task-specific auxiliary losses:
| Loss | Purpose | Lambda | Used By |
|---|---|---|---|
| Flow Matching (L_FM) | Core generation quality | 1.0 | All |
| Identity (ArcFace cosine) | Face preservation | 0.1 | PixelSmile |
| Symmetric Contrastive | Distinguish similar outputs | 0.05 | PixelSmile |
| LPIPS | Perceptual similarity | varies | RealRestorer |
Commercial Viability¶
Base model (Qwen-Image-Edit-2511) is Apache 2.0. Your LoRA weights are entirely yours. This creates a clean commercial path:
No license contamination from base model. LoRA weights are your IP.
Gotchas¶
- Qwen-Image-Edit requires DiffSynth-Studio framework, not standard diffusers. LoRA loading path differs.
- PixelSmile requires a diffusers patch script (
patch_qwen_diffusers.sh). - At rank 64+, LoRA training on 60GB base needs 4x 80GB GPUs. Lower rank (16-32) fits on 2x A100.
- EMA (exponential moving average) on LoRA weights recommended for stability — PixelSmile uses it.
- Data quality matters more than data quantity — synthetic data with VLM scoring outperforms larger messy datasets.