LoRA Fine-Tuning for Editing Models¶

Practical patterns for applying LoRA adapters to MMDiT-based editing models (Step1X Edit, Qwen-Image-Edit-2511). Demonstrated by PixelSmile — 850 MB LoRA adds entirely new capability (expression control) to a 60 GB base model.

Standard Recipe¶

Target Modules¶

Full MMDiT coverage (maximum expressivity):

target_modules = [
    # Image-stream attention
    "to_q", "to_k", "to_v", "to_out.0",
    # Text-stream attention
    "add_q_proj", "add_k_proj", "add_v_proj", "to_add_out",
    # Image FFN
    "img_mlp.net.0.proj", "img_mlp.net.2",
    # Text FFN
    "txt_mlp.net.0.proj", "txt_mlp.net.2",
]

Hyperparameters (PixelSmile reference)¶

Parameter	Value	Notes
Rank	64	Higher than typical (32 works for simpler tasks)
Alpha	128	2x rank is standard
Dropout	0	Common for diffusion LoRA
LR	1e-4	Cosine schedule
Batch	4/GPU	On H200
Epochs	100	Expression task; simpler tasks need fewer
Hardware	4x H200	Or 4x A100-80GB
LoRA size	850 MB	For rank 64, all targets

What to Freeze¶

VAE: always frozen (autoencoder doesn't need task-specific adaptation)
Text encoder: depends on task. PixelSmile trains it (needs new text→expression mapping). For style transfer, often frozen.
Transformer: LoRA applied here is the core of the adaptation

Lighter Alternatives¶

Coverage	Targets	LoRA Size	Use When
Full (PixelSmile)	All attn + all FFN	~850 MB	New behavior (expressions, restoration)
Attention-only	to_q/k/v + add_q/k/v + out	~500 MB	Style transfer, composition
Image-stream only	to_q/k/v + img_mlp	~350 MB	Visual-only changes (no text reinterpretation)

Reducing to attention-only drops ~40% LoRA size. Image-stream-only cuts ~60% but loses text conditioning adaptation.

Training Data Patterns¶

Synthetic Generation (PixelSmile approach)¶

Collect base identities from public datasets
Generate variations via strong API model (Nano Banana Pro)
Annotate with continuous scores via VLM (Gemini 3 Pro)
Train LoRA on synthetic pairs

Full Fine-Tune vs LoRA (MACRO comparison)¶

MACRO uses full fine-tune (not LoRA) because the task (multi-reference at 6-10 images) requires deep architectural adaptation. LoRA is sufficient when the base model already has the capability but needs behavioral steering.

Rule of thumb: if the base model can do the task poorly → LoRA. If it fundamentally can't → full fine-tune.

Loss Functions for Editing LoRA¶

Standard Flow Matching velocity loss + task-specific auxiliary losses:

Loss	Purpose	Lambda	Used By
Flow Matching (L_FM)	Core generation quality	1.0	All
Identity (ArcFace cosine)	Face preservation	0.1	PixelSmile
Symmetric Contrastive	Distinguish similar outputs	0.05	PixelSmile
LPIPS	Perceptual similarity	varies	RealRestorer

Commercial Viability¶

Base model (Qwen-Image-Edit-2511) is Apache 2.0. Your LoRA weights are entirely yours. This creates a clean commercial path:

Apache 2.0 base (Qwen) + proprietary LoRA = your product

No license contamination from base model. LoRA weights are your IP.

Gotchas¶

Qwen-Image-Edit requires DiffSynth-Studio framework, not standard diffusers. LoRA loading path differs.
PixelSmile requires a diffusers patch script (patch_qwen_diffusers.sh).
At rank 64+, LoRA training on 60GB base needs 4x 80GB GPUs. Lower rank (16-32) fits on 2x A100.
EMA (exponential moving average) on LoRA weights recommended for stability — PixelSmile uses it.
Data quality matters more than data quantity — synthetic data with VLM scoring outperforms larger messy datasets.