Step1X-Edit¶

Open-source image editing foundation model by StepFun (Shanghai). De facto standard open backbone for instruction-based image editing (2025-2026). Qwen-Image-Edit-2511 by Alibaba/Qwen is a production-tuned variant of this architecture.

Architecture¶

Input Image → VAE Encode → image latent (z)
Text Instruction → Qwen 2.5 VL → text embeddings (c)
                         ↓
              MMDiT Transformer Blocks
         (joint attention over z and c)
                         ↓
              VAE Decode → edited image

Three core components:

Component	Implementation	Trainable?	Size
Text Encoder	[[Qwen 2.5 VL]] (vision-language model)	Yes (usually)	~7B params
Transformer	MMDiT with joint attention	Yes / LoRA	~20B+ params
VAE	Custom autoencoder (RealRestorerAutoencoderKL variant)	Usually frozen	~200M params

Total weights: ~40-60 GB in bf16 depending on variant.

Why Qwen VL instead of CLIP¶

Previous editing models (InstructPix2Pix, MagicBrush) used CLIP as text encoder. CLIP encodes text only — it cannot "see" the input image at encoding time. Qwen 2.5 VL is a full vision-language model: - Processes both the text instruction AND the input image - Understands spatial relationships ("move the cup to the left") - Reasons about what to change vs what to preserve - Generates richer conditional embeddings

This is the key architectural innovation that separates Step1X-Edit generation models from prior approaches.

Scheduler¶

Uses Flow Matching instead of DDPM/DDIM. Default inference: 28 steps, guidance_scale 3.0. Faster convergence, more stable at low step counts.

Variants¶

Model	Maintainer	License	Notes
Step1X-Edit	StepFun	Apache 2.0	Original
Qwen-Image-Edit-2511	Alibaba/Qwen	Apache 2.0	Production variant, DiffSynth-Studio framework
RealRestorerTransformer2DModel	RealRestorer team	Academic only (weights)	Modified for restoration

Inference¶

from diffsynth import QwenImageEditPlusPipeline
pipe = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2511", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
result = pipe(prompt="change the background to a beach", image=input_image, num_inference_steps=28, guidance_scale=3.0)

VRAM: ~40 GB at 1024x1024 in bf16. With CPU offload — fits on 24 GB but slower. FP8 quantization → ~20-30 GB.

Downstream Models Built on This Architecture¶

RealRestorer — image restoration (9 degradation types)
PixelSmile — facial expression editing (LoRA rank 64, 850 MB)
MACRO Qwen variant — multi-reference generation (full fine-tune)

Commercial Use¶

Apache 2.0 — fully permitted for commercial use. Both Step1X-Edit and Qwen-Image-Edit-2511. This makes it the go-to backbone for commercial editing products, unlike models with NC restrictions.

Key Links¶

Step1X-Edit GitHub: github.com/stepfun-ai/Step1X-Edit
Qwen-Image-Edit HF: huggingface.co/Qwen/Qwen-Image-Edit-2511
Framework: DiffSynth-Studio (github.com/modelscope/DiffSynth-Studio)