Tiled Inference for High-Resolution Processing¶
Techniques for processing images larger than model input size by splitting into overlapping tiles, processing individually, and stitching back. Critical for production photography where originals are 5000-15000px.
Core Problem¶
Most models accept 512-1024px max. Production photography needs pixel-perfect results at 5000-15000px. Naive splitting → visible seams at tile boundaries, especially on: - Smooth gradients (skin, sky, metal surfaces) - Repeating patterns (textures, fabrics) - Fine details crossing tile boundaries (jewelry edges, hair)
SAHI (Slicing Aided Hyper Inference)¶
Standard tiled inference library for detection models (YOLO, etc.):
from sahi.predict import get_sliced_prediction
result = get_sliced_prediction(image, model, slice_height=640, slice_width=640, overlap_height_ratio=0.2, overlap_width_ratio=0.2)
Works for detection (merges bounding boxes across tiles). For generation/editing tasks, stitching is more complex.
Overlap-Based Stitching for Generation¶
From chat discussions (Vladimir's approach):
1. Take tiles with overlap (e.g., input 1000×1000, keep center 800×800)
2. For next tile, include edges from already-processed tiles as overlap
3. Model sees context from neighboring processed tiles
4. Crop to center, avoiding boundary artifacts
Gradient-Aware Blending¶
For smooth gradients (jewelry metal surfaces): - Linear blend in overlap zone: output = alpha * tile_A + (1-alpha) * tile_B - Poisson blending at seams (laplacian-based, preserves gradients) - Latent-space stitching (as in X Dub): average latents at boundaries, then decode
FLUX Kontext Diff-Merge Approach¶
flux-kontext-diff-merge detects changed regions in LAB color space and selectively merges only the diff back into original using Poisson blending. Applicable to tiled processing — process tile, merge only changed pixels.
Known Issues (from team experience)¶
- Gradient discontinuity — smooth backgrounds show tile boundaries after processing. Mitigation: larger overlap, gradient-aware blending
- Detail loss at boundaries — fine elements (gemstone edges, chain links) crossing tiles get corrupted. Mitigation: tile grid aligned to segmentation masks, process objects wholly
- Color drift — cumulative color shift across many tiles. Mitigation: per-tile color correction against reference (as in X Dub's sliding window)
- Segmentation across tiles — models like SAM/ZIM eat max ~1200px per side; ideal edges need tiled segmentation with post-merge. Preliminary full-image seg at lower res → refine edges with high-res tiles
Latent-Space Tiling (for diffusion models)¶
Alternative: tile in latent space (after VAE encode, before denoising):
Full image → VAE encode → latent (H/8 × W/8 × C)
Tile latents → denoise each tile with overlap
Stitch latents → VAE decode full
Advantage: VAE decode of full stitched latent produces globally coherent output. This is how X Dub handles long video.
Solar Curves (Edge Visualization)¶
Non-standard technique from retouching: apply solar curve (tone inversion at midtones) to reveal subtle edges invisible at normal viewing. Can be used as auxiliary input for segmentation models to improve edge detection on white-on-white or low-contrast boundaries.
# Solar curve: invert midtones to reveal hidden edges
# See solar-curves-saver.py, solar-1-effect.py, solar-2-effect.py (team scripts)
Discussed for: white jewelry on white background segmentation, metal edge detection, gradient quality verification after tiled processing.
VRAM-Aware Tile Sizing¶
For low-VRAM GPUs, tile size must be calculated from available memory. See low vram inference strategies for detailed adaptive tile selection.
Quick reference (FP16 U-Net restoration model, 20 MB weights):
| Tile Size | Total VRAM | Fits 2 GB GPU |
|---|---|---|
| 128x128 | ~60 MB | Yes |
| 256x256 | ~140 MB | Yes |
| 512x512 | ~450 MB | Yes |
| 1024x1024 | ~1.6 GB | Tight |
BatchNorm warning: tiling with BatchNorm causes artifacts that overlap cannot fix (statistics computed per-tile, not per-image). Use LayerNorm/InstanceNorm, or train on tiles matching inference size.
Global Context Conditioning¶
Standard tiling loses global context per-tile. Methods to inject scene-wide information:
CMod (Channel-wise Modulation)¶
64-dimensional global context vector derived from the full image (downsampled or separately encoded). Injected at each tile via channel-wise multiplication in the feature space.
# Global vector: encode downsampled image → 64d vector
global_ctx = encoder_small(image_resized) # [B, 64]
# Inject per tile via FiLM-style modulation
for tile_features in tile_feature_list:
scale, shift = proj_layer(global_ctx).chunk(2, dim=-1)
tile_features = tile_features * (1 + scale) + shift
Advantage over cross-attention: 64d vector is extremely lightweight. Constant overhead regardless of image size.
FiLM (Feature-wise Linear Modulation)¶
Learnable affine transform conditioned on global context at each normalization layer. More expressive than CMod, compatible with LayerNorm-based architectures.
Bottleneck Attention (DehazeXL)¶
Tokenize full image at low resolution → encode to bottleneck → inject via cross-attention into each tile's processing. Tile attends to global semantic tokens while doing local processing.
Full image (downsampled) → Tokenize → Encoder → [bottleneck tokens]
Each tile processing:
Tile tokens (local) + bottleneck tokens (global) → cross-attention → tile output
Taxonomy of Global Context Methods¶
| Method | Mechanism | Overhead | Best For |
|---|---|---|---|
| CMod | 64d channel multiply | Minimal | Color/lighting consistency |
| FiLM | Affine at each norm layer | Low | Style consistency |
| Bottleneck attn (DehazeXL) | Cross-attn to global tokens | Medium | Semantic consistency |
| ConvGRU | Recurrent scanning | Medium | Sequential coherence |
| Mamba | State-space scanning | Medium | Long-range dependencies |
| Skip-residual (DemoFusion) | Add global image downsampled | Low | Structure consistency |
Positional Encoding for Tiles¶
| Method | Global Awareness | Cost |
|---|---|---|
| None (no PosEnc) | Tile-local only | Zero |
| Learned tile position embedding | Knows grid position | Minimal |
| RoPE on full-image coords | Continuous position | Low |
| Spectral encoding of location | Frequency-based position | Low |
2025 Papers: Artifact Reduction¶
STA (Sliding Tile Attention)¶
Tiles the attention computation itself (not just inference). 91% sparsity in attention mask, 3.53× speedup on long sequences. Applied to FLUX for high-res generation. Not inference-time only - built into the model's attention mechanism.
FreeScale¶
Scale Fusion: separates self-attention into global (low-frequency structure) and local (high-frequency detail) branches. Local branch processes individual tiles, global branch processes downsampled full image. Merges at inference time without fine-tuning.
# Conceptual:
global_attn = self_attention(downsample(full_image)) # structure
local_attn = self_attention(tile) # detail
output = merge(global_attn, local_attn)
ScaleDiff (NPA + LFM + SDEdit)¶
Three components: - NPA (Non-uniform Pixel Aggregation): weighted tile blending based on confidence scores instead of uniform averaging - LFM (Local Feature Matching): aligns features across tile boundaries before blending - SDEdit integration: runs partial-noise SDEdit on boundary zones to smooth remaining artifacts
APT (Adaptive Pattern Transfer)¶
Statistical matching at tile boundaries:
# For each overlapping boundary region:
mu_left, sigma_left = target_region.mean(), target_region.std()
mu_right, sigma_right = neighbor_region.mean(), neighbor_region.std()
# Transfer statistics: align neighbor to match target's distribution
neighbor_normalized = (neighbor_region - mu_right) / sigma_right
neighbor_aligned = neighbor_normalized * sigma_left + mu_left
No learning required. Works as post-processing on any tiled output.
AccDiffusion v2 (Patch-Content-Aware Prompting)¶
Generates per-tile prompts by querying a VLM about the tile content. Prevents object duplication: if left tile has "mountain", right tile prompt explicitly excludes "mountain" or specifies "continuation of mountain range."
tile_caption = vlm_describe(tile_crop_from_low_res_guide)
unique_tile_prompt = base_prompt + " " + tile_caption
SANA Compatibility Matrix¶
| Method | SANA Compatible | Notes |
|---|---|---|
| SyncDiffusion | No | UNet-based, incompatible |
| MultiDiffusion | Partial | Works but ignores linear attention structure |
| DemoFusion | No | UNet-specific skip connections |
| FreeScale | Yes | Architecture-agnostic |
| APT | Yes | Post-processing, model-agnostic |
| AccDiffusion v2 | Yes | Prompt-level, model-agnostic |
| DC-AE native tiling | Yes | pipe.vae.enable_tiling(...) |
Frequency-Aware Tiling (2024-2026)¶
Standard tiling loses high-frequency details (fine texture, fabric weave, hair) at tile boundaries. Frequency-domain methods preserve these while enabling global coherence.
HiWave (SIGGRAPH Asia 2025)¶
Paper: arxiv 2506.20452
Training-free. Patch-wise DDIM inversion + wavelet-based detail enhancer: 1. Retain low-frequency structure from the base image (global consistency) 2. Selectively guide high-frequency components for fine detail enhancement
Preferred in >80% of user comparisons vs prior SOTA. From Disney Research.
Frequency-Aware Guidance (ECCV 2024 Workshop)¶
Paper: arxiv 2411.12450
Plug-in DWT-based loss enforcing consistency in both spatial and frequency domains simultaneously:
import torch, pywt
def freq_aware_loss(pred, target, wavelet='haar', level=2):
# Decompose both into wavelet subbands
pred_coeffs = pywt.wavedec2(pred.cpu().numpy(), wavelet, level=level)
tgt_coeffs = pywt.wavedec2(target.cpu().numpy(), wavelet, level=level)
# Match low-frequency structure + high-frequency details
loss_lf = F.mse_loss(pred_coeffs[0], tgt_coeffs[0])
loss_hf = sum(F.mse_loss(p, t) for p, t in zip(pred_coeffs[1:], tgt_coeffs[1:]))
return loss_lf + 0.1 * loss_hf
Drop-in addition to any diffusion pipeline. +3.72 dB PSNR on blind deblurring.
Latent Wavelet Diffusion (LWD, 2025)¶
Paper: arxiv 2506.00433
Three-component framework enabling any LDM to scale to 2K-4K without architectural changes: 1. Scale-consistent VAE objective — spectral fidelity across resolution changes 2. Wavelet energy maps — identify detail-rich regions (skin, hair, fabric) where denoising must be strongest 3. Time-dependent masking — focus denoising budget on high-frequency components at early timesteps
# Wavelet energy map identifies where texture preservation matters:
coeffs = pywt.wavedec2(image, 'db4', level=3)
energy_map = sum(np.abs(c)**2 for c in coeffs[1:]) # HF energy per pixel
W-Edit (Oct 2025)¶
Training-free wavelet-based frequency modulation for text-driven editing. Decomposes diffusion features into multi-scale wavelet bands, separating structural anchors (preserve) from editable details (modify). Selective injection into pretrained DiT/UNet layers.
Face Retouching with Spectral Restoration (ICCV 2025)¶
Directly applicable to face retouching pipelines:
- Frequency Selection and Restoration (FSR): frequency-domain quantization with spatial projections
- Multi-Resolution Fusion (MRF): Laplacian pyramid fusion — removes blemishes while preserving fine details
Approach: separate low-frequency (skin tone, broad shading) from high-frequency (pores, fine lines) at each pyramid level. Edit only the relevant level for the specific retouching task.
LF0 (coarse, global) → lighting/color correction
HF1 (medium) → blemish/spot removal
HF2 (fine) → preserve skin texture
HF3 (finest) → preserve hair / eyelash detail
RefineAnything (2026)¶
GitHub: github.com/limuloo/RefineAnything
HuggingFace: limuloo1999/RefineAnything
Multimodal diffusion model for region-specific refinement. Fixes distorted text, logos, fine structures within a specified region while keeping the background intact. Supports editing with and without a reference image.
For tiled pipelines: - Post-assembly seam refinement: run on boundary zones between tiles to fix stitching artifacts - Uses "reverse diffusion detailing" pattern — diffusion forward (denoise) then backward (add detail) - Feeding a thumbnail first means fewer denoising steps needed
APT: Adaptive Path Tracing (Jul 2025)¶
Paper: arxiv 2507.21690
Identifies two failure modes in patch-based methods: 1. Patch-level distribution shift — different patches develop different color/brightness statistics 2. Increased patch monotonicity — local context causes repetitive detail within a patch
Fixes via Statistical Matching (align patch distributions) + Scale-aware Scheduling (global coherence injection).
Frequency Method Comparison¶
| Method | Training Required | Target Problem | Testability |
|---|---|---|---|
| HiWave | No (DDIM inversion) | Patch detail enhancement | 3-4 days |
| Freq-Aware Guidance | No (plug-in loss) | Texture preservation at boundaries | 2-3 days |
| Latent Wavelet Diffusion | No (framework) | 2K-4K upscaling with texture | 3-4 days |
| FSR (face retouching) | Code unclear | Face blemish removal + detail preserve | 5-7 days |
| RefineAnything | No (pretrained) | Region-specific artifact fixing | 1-2 days |
See Also¶
- low vram inference strategies - adaptive tile sizing, memory management
- temporal tiling - cross-tile context propagation
- diffusion inference acceleration - complementary acceleration techniques
- SANA - 32x compression reduces tiling need
- intrinsic decomposition - frequency-independent color correction for tiles