Skip to content

Skin Retouch Pipeline

Intermediate

Automated blemish detection and removal pipeline for photos. Two-stage architecture: detect defects -> inpaint with texture preservation. Key constraint: high-frequency skin texture (pores, microrelief) must survive the process.

Pipeline Architecture

Input photo
  -> [Detection] INSID3 or YOLOE + SAM 2
  -> pixel-accurate masks with +5-10px dilation
  -> [Routing] by defect size:
     small (<1% area) -> LaMa or frequency-split
     medium/edge      -> FLUX.1 Fill (8 steps)
     problem cases    -> FLUX.1 Fill (20 steps)

Estimated speed on H200 (2048x2048 photo with 5-10 blemishes): ~2.5 seconds total.

Stage 1: Detection

One-shot in-context segmentation on frozen DINOv3. Annotate 1-3 reference examples of the defect type -> segments the same category on new images. No training, no decoder, no auxiliary models.

Why ideal for skin: one-shot by defect type (annotate 1-3 blemish examples with masks), outputs segmentation masks directly (no SAM refinement needed), Apache 2.0, self-hosted.

from models import build_insid3
model = build_insid3()  # DINOv3 ViT-L default
model.set_reference("ref_acne.png", "ref_mask.png")
model.set_target("new_photo.png")
pred_mask = model.segment()

Speed: 3.31 FPS (3.4x faster than GF-SAM, 30x faster than Matcher). Backbones: ViT-S/B for 2 GB VRAM, ViT-L default.

YOLOE + SAM 2 (Alternative)

YOLOE detects bounding boxes via visual prompt (SAVPE encoder). SAM 2 refines boxes to pixel-accurate masks. Two models but well-tested pipeline.

  • YOLOE: 4-8 GB VRAM, real-time, 1.4x faster than YOLO-Worldv2. Ultralytics support.
  • SAM 2: ~8-10 GB VRAM. Already in ComfyUI ecosystem.

Specialized Acne Detectors

Pre-trained YOLO models exist for face acne (Roboflow yolov8-acne, skin-disease-recognition). ACNE04 dataset available for fine-tuning. These work on selfie-level faces; body shots need fine-tuning on custom data.

Anomaly Detection Approach (MuSc)

Zero-shot, no training, no prompts. Uses the principle that normal patches have many similar neighbors, anomalies don't. Blemishes = skin anomalies. Proven on industrial inspection (MVTec AD), untested on skin domain but conceptually fitting.

Stage 2: Inpainting

For Small Masks (<1% area) - Classical Methods Win

For 2-20px blemishes on uniform skin, neural networks are overkill and often worse than classical methods. The task is copying neighboring skin, not generating structure.

# For each mask (2-20px blemish):
crop = image[y-32:y+32, x-32:x+32]          # 64x64 window
low_freq = gaussian_blur(crop, sigma=8)
high_freq = crop - low_freq + 0.5

# Replace LF in mask with median of surrounding ring
donor_low = median(low_freq, ring_8_to_30px)
# Copy HF from nearby clean skin patch
donor_hf = patchmatch_nearest(high_freq, mask, radius=50)

result = donor_low + (donor_hf - 0.5)
# Paste back with feathered edge

20-100ms per mask, pure CPU + numpy/OpenCV. Texture preserved because HF comes from real skin.

PatchMatch + Poisson Blending (0 GB VRAM)

# Dilate mask +3-5px (shift seam away from defect edge)
mask_dilated = cv2.dilate(mask, kernel=5)
result = pypatchmatch.inpaint(image, mask_dilated, patch_size=7)
result = cv2.seamlessClone(result, image, mask, center, cv2.MIXED_CLONE)

50-200ms on 512x512 crop. PatchMatch copies real pixel patches, Poisson handles lighting mismatch.

Why Classical Beats Neural Here

At 2-20px mask size, there's nothing to "hallucinate" - just copy neighboring skin. PatchMatch preserves actual pores. Neural inpainters (even LaMa) tend to average HF content, producing flat "plastic" skin in the mask region.

For Medium/Complex Masks - Neural Inpainting

LaMa / Big-LaMa (1-2 GB VRAM)

  • ~2 seconds on HD image, 7-10% mask area
  • Apache 2.0, battle-tested for skin retouch and tattoo removal
  • Good on small uniform areas, poor on complex texture boundaries
  • Fourier convolutions can flatten skin texture in larger masks
  • Use with HF reinjection post-fix (see below)

FLUX.1 Fill + Turbo-Alpha (24+ GB VRAM)

  • 8 steps with Turbo-Alpha LoRA: ~3-5 seconds on H200
  • Best open-source semantic inpainting for skin
  • Alimama ControlNet-Inpainting-Beta for direct control
  • GGUF variants from 12-16 GB

RETHINED (WACV 2025 Oral, 0.5 GB VRAM)

Lightweight CNN for structure + patch replacement for details. Same hybrid principle as frequency-split but end-to-end trained. <=30ms on mobile. Works on UHD. Best "one model for everything" option under 2 GB VRAM.

HF Reinjection Post-Fix

For any neural inpainter that flattens texture:

# After neural inpaint:
hf_original = original_crop - gaussian(original_crop, sigma=8)
hf_donor = patchmatch(hf_original, mask_ring_30px)
# Replace HF in mask area with donor
result_hf = inpainted_crop - gaussian(inpainted_crop, sigma=8)
result_hf[mask] = hf_donor[mask]
final = gaussian(inpainted_crop, sigma=8) + result_hf

Neural-clean low-frequency + real-skin high-frequency. This is the frequency separation principle from professional retouching applied as automated post-processing.

LoRA Alternative: Single-Pass Edit

Train an edit LoRA on FLUX.2 Klein 9B with before/after blemish pairs (50-200 pairs). The model learns to see and remove defects without explicit detection/masking.

Pros: single step, no pipeline complexity. Cons: no control over what gets modified - risk of overcorrecting (smoothing pores, removing moles). Works best as semi-automated with human selection of final result.

See LoRA Fine Tuning for Editing Models for training details.

VRAM Budget Guide

VRAM Budget Detection Inpainting
0 GB (CPU) INSID3 ViT-S on CPU Frequency-split or PatchMatch
2 GB INSID3 ViT-S/B RETHINED or LaMa fp16 on 256x256 crops
8 GB YOLOE + SAM 2 LaMa full resolution
24+ GB Any FLUX.1 Fill + Turbo-Alpha (8 steps)
40+ GB Any Qwen-Image-Edit (20B) for complex cases

Gotchas

  • DINOv3 patch size (14-16px) may be too coarse for very small blemishes (2-5px). Feed high-res crops rather than full images to INSID3 for small defect detection.
  • LaMa Fourier convolutions flatten skin texture in larger masks - always combine with HF reinjection if texture preservation matters.
  • T-Rex2 and DINO-X are cloud-only - no self-hosting option. Privacy concern for client photos. Use YOLOE or INSID3 instead.
  • No public end-to-end "blemish remover" model exists with open weights. All production solutions (Lightroom, PortraitPro) are closed. Open-source approach requires assembling detect+inpaint pipeline.
  • Color jitter augmentation destroys skin retouch training - skin tone consistency is critical for before/after pairs.

See Also