Skip to content

Calligrapher (Freestyle Text Image Customization)

Text generation and editing on images with style reference. Built on [[FLUX Kontext|FLUX.1-Fill-dev]] + SigLIP style encoder. Takes font/style sample from image, generates new text in same style.

Paper: arXiv:2506.24123 (June 2025).

Architecture

Style reference image → SigLIP (ViT, siglip-so400m-patch14-384)
                           → Qformer (learnable queries)
                           → Linear projection → K_E, V_E
Input image + mask → FLUX.1-Fill-dev denoising ← cross-attention:
                                         Q from denoiser, K/V from style encoder
                                         Output with styled text

Style injection: encoder output replaces K and V matrices in style attention module (not concatenation — replacement).

Self-Distillation Training

No manual annotation needed: 1. LLM generates prompts with typographic style descriptors 2. Pretrained FLUX synthesizes stylized text images 3. Neural text detection locates text regions 4. Strategic cropping → style reference + transfer target 5. Model trains on self-generated pairs

Modes

  • Self-reference: change text content, preserve original style
  • Cross-reference: apply style from different image
  • Non-text reference: transfer from arbitrary images (fire, water, etc.)
  • Multilingual: Chinese, Korean, Japanese via TextFLUX

Results

FID: 38.09 vs 66-70 (baselines). OCR accuracy: 0.84 vs 0.45-0.81. 72% user preference.

VRAM / Speed

~4s per image on A6000 at 10 steps. Recommended resolution: 512px (trained at this). 768px acceptable, higher → spelling errors.

License

Inherits FLUX.1-Fill-dev Non-Commercial License. Outputs can be used commercially.

  • GitHub: github.com/Calligrapher2025/Calligrapher
  • HF: huggingface.co/Calligrapher2025/Calligrapher