Skip to content

MACRO (Multi-Reference Image Generation)

Dataset + benchmark + fine-tuning recipe that fixes quality degradation when generation models receive many (6-10) reference images. Not a new architecture — applied to existing models (Bagel, OmniGen2, Qwen-Image-Edit).

Paper: arXiv:2603.25319 (March 2026). Authors: HKU MMLab + Meituan.

Problem

Models like Bagel, OmniGen2, Qwen-Image-Edit support <image N> placeholders for multi-reference generation, but quality drops sharply at 6+ images. Root cause: training data bottleneck — existing datasets dominated by 1-2 reference pairs with no structured supervision for dense inter-reference dependencies.

Solution: Data-Centric

MACRO-400K Dataset

400K samples, up to 10 references per sample, average 5.44 references. Four task categories (100K each):

Task Description Sources
Customization Multi-subject composition OpenSubject, MVImgNet, DL3DV, WikiArt
Illustration Image from multimodal context OmniCorpus-CC-210M web crawl
Spatial Novel view synthesis G-buffer Objaverse, Pano360, Polyhaven
Temporal Future frame prediction OmniCorpus-YT videos

Balanced across reference count brackets: 1-3 / 4-5 / 6-7 / 8-10.

Construction pipeline: Split → Generate (Gemini + Nano APIs) → Filter (LLM scoring + bidirectional VLM assessment). The generation step uses proprietary APIs — pipeline not fully reproducible, but the resulting dataset is fully released.

Dynamic Resolution Scaling

At inference, input images automatically downsized as count increases: - 1-2 images: 1M px - 3-5 images: 590K px - 6+ images: 262K px

Training Recipe

Full fine-tune (not LoRA). Per-model framework:

Model Framework Training Size
Bagel (14.7B, MoT) FSDP + FLEX packing LR 2e-5, 10 epochs, VAE frozen ~29.5 GB
OmniGen2 Native framework Same hyperparams
Qwen-Image-Edit DiffSynth + DeepSpeed Same hyperparams ~98.6 GB

T2I co-training: 10% text-to-image data mixed in to preserve general T2I capability.

Results (MacroBench)

4000 samples, 16 sub-categories, LLM-scored:

Model Open? Score vs Base
Nano Banana Pro No 6.12
GPT-Image-1.5 No 5.89
Macro-Bagel Yes 5.71 +88% (base: 3.03)
Macro-OmniGen2 Yes significant improvement
Macro-Qwen Yes mitigates severe drops at 6-10

Macro-Bagel approaches Nano Banana Pro in Customization, surpasses it in Spatial tasks.

Ablation Insights

  • Sharpest gains between 1K-10K samples, diminishing returns 10K-20K
  • Upweighting large-input samples (2:2:3:3 ratio) helps without hurting low-input
  • Cross-task co-training provides synergistic benefits — spatial training helps customization

Inference

# Bagel variant
from inference_bagel import generate
result = generate(model, prompt="...", reference_images=[img1, img2, ...], resolution=768)
# Default: 768x768

VRAM: 40-80 GB depending on model variant. enable_model_cpu_offload supported for OmniGen2.

License

Component License
Code Apache 2.0 (HF; GitHub has no LICENSE file)
All 3 model weights Apache 2.0
MACRO-400K dataset CC-BY-4.0

Fully commercially usable — code, weights, and dataset.

Gotchas

  • Dataset construction uses proprietary Gemini/Nano APIs — cannot recreate dataset, but can use the released one
  • GitHub repo has no explicit LICENSE file yet (fresh project, 3 days old)
  • Full fine-tune requires multi-GPU setup (FSDP) — not a quick LoRA
  • Training code released but expects specific framework versions
  • GitHub: github.com/HKU-MMLab/Macro
  • HF models: huggingface.co/Azily/ (Macro-Bagel, Macro-OmniGen2, Macro-Qwen-Image-Edit)
  • HF dataset: huggingface.co/datasets/Azily/Macro-Dataset
  • Project page: macro400k.github.io