In-Context Segmentation¶
Segmenting images by example: provide one or more (image, mask) pairs and the model segments the same category in new images, without fine-tuning or retraining. Also called in-context segmentation (ICS) or few-shot segmentation.
Key Facts¶
- Training-free ICS uses frozen vision backbone features to match semantics between reference and target
- DINOv3 dense features cluster into semantically coherent regions without any supervision (emergent from scale)
- RoPE positional encoding in DINOv3 creates stronger positional bias than DINOv2's learned embeddings - must be debiased for semantic correspondence
- PCA-based debiasing: project features onto the orthogonal complement of the positional subspace to separate position from semantics
- Multi-domain segmentation from a single model: natural images, medical (dermoscopy, lung), underwater, aerial, satellite
- INSID3 achieves 3.31 FPS vs. GF-SAM 0.97 FPS vs. Matcher 0.11 FPS - the training-free approach is also faster
INSID3 (Training-Free, DINOv3)¶
Paper: arXiv:2603.28480 | Code: github.com/visinf/INSID3 | Venue: CVPR 2026 | License: Apache-2.0
Architecture¶
Reference (image + mask)
↓
DINOv3 frozen backbone (ViT-S/B/L)
↓
Positional Bias Debiasing (PCA projection)
↓ semantic features, position-free
Semantic matching with target features
↓
Clustering → binary mask
↓
(Optional) CRF refinement for sharp boundaries
Positional Bias Debiasing¶
DINOv3 uses RoPE (Rotary Position Embedding) instead of learned positional embeddings. Side effect: absolute pixel position is encoded more strongly in DINOv3 features than in DINOv2. Two semantically identical patches in different image regions get different feature vectors - this breaks cross-image semantic matching.
Fix: PCA-based projection onto the orthogonal complement of the positional subspace.
def debias_positional(features: torch.Tensor, n_pos_components: int = 8) -> torch.Tensor:
"""
features: (N, D) - N patch features from DINOv3
Returns: position-debiased features of same shape
"""
# Center features
mu = features.mean(dim=0, keepdim=True)
centered = features - mu
# PCA: find positional subspace (low-rank components encoding position)
U, S, Vh = torch.linalg.svd(centered, full_matrices=False)
pos_subspace = Vh[:n_pos_components] # (n_pos, D) - positional directions
# Project out positional subspace (orthogonal complement projection)
P = pos_subspace.T @ pos_subspace # (D, D) positional projector
debiased = centered - centered @ P # remove positional component
return debiased + mu
No learnable parameters - this is a purely linear operation applied at inference time.
DINOv3 vs DINOv2 for Dense Tasks¶
| DINOv2 | DINOv3 | |
|---|---|---|
| Teacher params | ~1B | 7B (7x) |
| Training images | 142M | 1.7B (12x) |
| Positional encoding | Learned | RoPE |
| Dense features (ADE20K) | Baseline | +6 mIoU |
| High-res support | ~518px | Stable >4K |
| Gram Anchoring | No | Yes |
Gram Anchoring (new in DINOv3): anchors patch-level feature similarity to an early checkpoint. Prevents the common degeneration where long training improves global features but hurts patch-level semantic density.
Usage¶
from models import build_insid3 # from official repo
model = build_insid3(backbone='vitl') # vitl gives best quality
# Single-example segmentation:
model.set_reference("reference.jpg", "reference_mask.png")
model.set_target("target.jpg")
mask = model.segment() # (H, W) boolean
# Multi-shot (multiple reference examples improve consistency):
model.set_references([
("ref1.jpg", "mask1.png"),
("ref2.jpg", "mask2.png"),
])
mask = model.segment("target.jpg")
Performance¶
Benchmarks (One-Shot Segmentation)¶
| Method | Architecture | mIoU | FPS | Params | Fine-tuning |
|---|---|---|---|---|---|
| INSID3 | DINOv3 ViT-L | SOTA +7.5% | 3.31 | 3x fewer | None |
| GF-SAM | DINO+SAM | — | 0.97 | — | None |
| Matcher | DINOv2 + decoder | — | 0.11 | — | Yes |
| PerSAM | SAM + adapter | — | — | — | Yes |
Multi-domain coverage: natural objects, medical imaging (ISIC skin lesions, lung CT), underwater (SUIM), aerial/satellite (iSAID)
Multi-granularity: whole objects, object parts, personalized instances
Backbone Ablation¶
Replacing DINOv3 with alternatives degrades significantly: - DINOv2: -6 mIoU on dense tasks - Perception Encoder: below DINOv3 - Stable Diffusion 2.1 features: weakest
The quality gap validates that DINOv3's dense feature quality (not just the pipeline) is the key factor.
Practical Applications¶
Batch Annotation for Datasets¶
from pathlib import Path
def annotate_batch(reference_img: str, reference_mask: str,
target_dir: str, output_dir: str):
"""
Annotate all images in target_dir using single reference.
Useful for: LoRA dataset preparation, fine-tuning data collection.
"""
model = build_insid3()
model.set_reference(reference_img, reference_mask)
Path(output_dir).mkdir(parents=True, exist_ok=True)
for img_path in Path(target_dir).glob("*.jpg"):
model.set_target(str(img_path))
mask = model.segment()
# Save as PNG mask
mask_out = Path(output_dir) / (img_path.stem + "_mask.png")
(mask * 255).astype("uint8") # to PIL and save
Part Segmentation Pipeline¶
# Segment specific body parts or object components
# by providing a reference showing only that part's mask
parts = {
"hair": ("ref_full.jpg", "ref_hair_mask.png"),
"face": ("ref_full.jpg", "ref_face_mask.png"),
"hands": ("ref_full.jpg", "ref_hands_mask.png"),
}
for part_name, (ref_img, ref_mask) in parts.items():
model.set_reference(ref_img, ref_mask)
part_mask = model.segment("target.jpg")
save_mask(part_mask, f"target_{part_name}_mask.png")
Quality Control¶
def validate_mask_quality(mask, min_area_ratio=0.001, max_area_ratio=0.9):
"""Basic sanity checks for auto-generated masks."""
area_ratio = mask.mean()
if area_ratio < min_area_ratio:
return "too_small" # likely missed the object
if area_ratio > max_area_ratio:
return "too_large" # likely leaked to background
# Check connectivity (fragmented masks often indicate failure):
from scipy import ndimage
labeled, n_components = ndimage.label(mask)
if n_components > 10:
return "fragmented" # noisy segmentation
return "ok"
When to Use INSID3 vs SAM¶
| Scenario | INSID3 | SAM (click-based) |
|---|---|---|
| Batch annotation, same category | Best | Slow (needs clicks per image) |
| Novel/rare category | Good | N/A (requires prompt engineering) |
| Interactive click-based | Slow (3.31 FPS) | Better (<20ms per click) |
| Soft boundaries (hair, fur) | Needs CRF, human review | Also struggles |
| Medical imaging domain | Validated (ISIC, lung) | Often needs fine-tuning |
| Non-English labels | Works (vision-only) | Works |
Setup¶
git clone https://github.com/visinf/INSID3
cd INSID3
# DINOv3 backbone weights (from Meta FAIR):
# https://github.com/facebookresearch/dinov3
# Download ViT-L weights for best quality
pip install -r requirements.txt
# PyTorch >= 2.0, Python 3.10, CUDA 12.6
Gotchas¶
- Reference selection is critical - INSID3 transfers the visual semantics of YOUR reference mask, not a category name. A poorly cropped or ambiguous reference mask will produce poor results. Use tight, unambiguous reference masks with diverse visual examples of the target category
- 3.31 FPS means batch, not real-time - at ~300ms per image, INSID3 is not suitable for interactive annotation or video. Use it for offline batch processing of large datasets
- Soft/ambiguous boundaries need human review - masks for hair, fur, transparent objects, and motion-blurred edges are systematically less accurate. CRF post-processing helps but doesn't solve the fundamental ambiguity; plan for human-in-the-loop on these categories
- Positional debiasing requires calibration - the number of PCA components to remove (
n_pos_components) may need tuning per domain; too few leaves residual position encoding, too many removes semantic signal along with position
See Also¶
- [[image-generation/SANA-Denoiser Architecture]] - related dense feature architectures
- [[data-science/cnn-computer-vision]] - CNN backbones used in segmentation
- [[data-science/attention-mechanisms]] - ViT attention used in DINOv3