Image Similarity Pipeline: Graph Supervision, Contrastive Training, Vector Search¶

★★★★★ Intermediate

Building an image retrieval system where "similar" is defined by weak graph supervision (co-engagement edges, e.g., Pinterest-style pin_related). Covers noise filtering, backbone selection, contrastive training recipes, and serving via vector search.

Reference scale: 430K images, 7.2M graph edges, CLIP+CSD+color in Elasticsearch.

Part 1: Cleaning Graph-Supervised Training Data¶

Nature of the Noise¶

Co-engagement graphs (co-save, co-click) generate noisy positives:

L1 edges (same board, same session) - high-quality, ~90% true positives
L2+ edges (neighbour-of-neighbour) - 10-30% noise, topic drift not random garbage
False negatives are massive: visually identical items from different boards have no edge

At 7.2M edges over 430K pins (~16.7 avg degree), expect 10-30% label noise by published benchmarks.

Cluster-Coherence Filtering (no labels needed)¶

Best ROI first pass. Requires embedding vectors per image and k-means cluster assignments.

import numpy as np
from scipy import stats

def cluster_coherence_filter(edges, embeddings, cluster_ids, z_threshold_drop=-3, z_threshold_flag=-2):
    """
    edges: list of (src_id, dst_id)
    embeddings: {id: np.array} - CSD or CLIP vectors
    cluster_ids: {id: int}
    Returns: (confirmed, flagged, dropped)
    """
    from sklearn.metrics.pairwise import cosine_similarity

    cluster_edges = {}  # cluster_id -> list of (src, dst, similarity)
    for src, dst in edges:
        sim = cosine_similarity([embeddings[src]], [embeddings[dst]])[0][0]
        cid = cluster_ids[src]
        cluster_edges.setdefault(cid, []).append((src, dst, sim))

    confirmed, flagged, dropped = [], [], []
    for cid, edge_list in cluster_edges.items():
        sims = [e[2] for e in edge_list]
        if len(sims) < 10:
            confirmed.extend(edge_list)
            continue
        z_scores = stats.zscore(sims)
        for (src, dst, sim), z in zip(edge_list, z_scores):
            if z < z_threshold_drop:
                dropped.append((src, dst, sim, z))
            elif z < z_threshold_flag:
                flagged.append((src, dst, sim, z))
            else:
                confirmed.append((src, dst, sim, z))

    return confirmed, flagged, dropped

Expected output: drops 8-15% as obvious noise, flags 5-10% for review.

Do NOT auto-drop L1 edges (same board, same session) - user explicitly grouped these even if CSD distance looks wrong.

NN-Graph Cross-Reference¶

def nn_graph_partition(pin_ids, edges, es_client, k=50):
    """
    Partition edges into: confirmed (in both pin_related AND CSD kNN),
    drop_candidate (in pin_related but NOT in CSD kNN),
    add_candidate (in CSD kNN but NOT in pin_related)
    Typical ratio: 60% confirmed, 25% drop-candidate, 15% add-candidate
    """
    edge_set = set((a, b) for a, b in edges)
    confirmed, drop_candidate, add_candidate = [], [], []

    for pin_id in pin_ids:
        # kNN query to ES for this pin's CSD vector
        knn_results = es_client.knn_search(pin_id, k=k, field="csd_vec")
        knn_set = set(r['_id'] for r in knn_results)

        for neighbor_id in knn_set:
            if (pin_id, neighbor_id) in edge_set:
                confirmed.append((pin_id, neighbor_id))
            else:
                add_candidate.append((pin_id, neighbor_id))

        for src, dst in edge_set:
            if src == pin_id and dst not in knn_set:
                drop_candidate.append((src, dst))

    return confirmed, drop_candidate, add_candidate

Active Learning: Signal Disagreement Ranking¶

def signal_disagreement_rank(edges, clip_vecs, csd_vecs, color_jaccard_fn):
    """Rank edges by variance across similarity signals.
    High variance = model uncertainty = most informative for labeling.
    Target 2-5K labels from the top of this ranking.
    """
    from sklearn.metrics.pairwise import cosine_similarity

    scores = []
    for src, dst in edges:
        csd_sim = cosine_similarity([csd_vecs[src]], [csd_vecs[dst]])[0][0]
        clip_sim = cosine_similarity([clip_vecs[src]], [clip_vecs[dst]])[0][0]
        color_sim = color_jaccard_fn(src, dst)
        variance = np.var([csd_sim, clip_sim, color_sim])
        scores.append((src, dst, variance, csd_sim, clip_sim, color_sim))

    return sorted(scores, key=lambda x: -x[2])  # highest variance first

Expected lift from 2-5K active labels: 3-8 points Recall@10 vs random labeling.

Labeling throughput guide: - Streamlit/Tkinter side-by-side binary (A/D/S keys): ~800-1200 pairs/hour - Label Studio form UI: ~200-400 pairs/hour

Output Artifacts¶

Save for reproducibility and model eval reuse:

edge_audit.parquet     # src, dst, csd_sim, clip_sim, color_jaccard, z_score, nn_confirmed, verdict
noise_classifier.pkl   # trained on 2K human labels, auto-labels remaining flagged edges
human_labels.sqlite    # 2K reviewed pairs, reuse in test set construction
edges_clean.parquet    # final training edge list

Part 2: Model Architecture¶

Backbone Signals¶

Backbone	Captures	Dim	Notes
CLIP ViT-L/14	Semantic content, concepts	768	Swap for SigLIP 2 ViT-L for 2-5 point lift
CSD (Somepalli 2024)	Style attributes, color palettes	768	SOTA open-source style descriptor
DINOv3 (Meta 2025)	Instance geometry, dense patches	1024	+10.9 GAP on instance retrieval vs DINOv2
Color features	Brightness, saturation, temperature	~24	Weak signal, use as tie-breaker only

SigLIP 2 (arxiv 2502.14786): drop-in replacement for CLIP with sigmoid loss, multilingual, better retrieval. ViT-L sweet spot for <1M images.

DINOv3 (arxiv 2508.10104): frozen features are near-SOTA on retrieval without fine-tuning. Add as fourth backbone.

Recommended Architecture: Frozen Multi-Backbone + Projection Head¶

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimilarityProjectionHead(nn.Module):
    def __init__(self, input_dim=1560, hidden_dim=1024, output_dim=256):
        super().__init__()
        # input_dim = CLIP(768) + CSD(768) + color(24) = 1560
        # Add DINOv3(1024) -> input_dim = 2584
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, 512),
            nn.GELU(),
            nn.Linear(512, output_dim),
        )

    def forward(self, x):
        return F.normalize(self.net(x), dim=-1)


def infonce_loss(q, k_pos, k_neg, temperature=0.07):
    """
    q: [B, D], k_pos: [B, D], k_neg: [B, K, D]
    All L2-normalized.
    """
    logits_pos = (q * k_pos).sum(-1, keepdim=True) / temperature
    logits_neg = torch.einsum('bd,bkd->bk', q, k_neg) / temperature
    logits = torch.cat([logits_pos, logits_neg], dim=1)  # [B, 1+K]
    labels = torch.zeros(q.size(0), dtype=torch.long, device=q.device)
    return F.cross_entropy(logits, labels)

Training config (Recipe A - 2-3 days, single GPU):

# Precompute backbone features first (saves per-epoch re-running)
# Load: clip_vecs[pin_id], csd_vecs[pin_id], dinov3_vecs[pin_id], color_vecs[pin_id]

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=3)

# Batch: (pin_a_features, pin_b_features) from cleaned edges
# Hard negatives: K=4 samples from same cluster as pin_a, not in edges
# Effective batch: 1024+ (bigger = more in-batch negatives = better)
# Epochs: 3-5
# H100 training time: 2-4 hours
# Expected Recall@10 lift: +5-10 points over weighted-kNN baseline

Recipe B - LoRA fine-tune SigLIP 2 (adds 3-5 more points):

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
siglip_model = get_peft_model(siglip_model, lora_config)
# Same InfoNCE loss, unfrozen encoder
# Training time: 6-10h H100, cost ~$15

Hard Negative Mining¶

def sample_hard_negatives(batch_ids, cluster_ids, edge_set, all_ids_by_cluster, K=4, margin=0.7):
    """
    For each pin, sample K negatives from same cluster that:
    1. Are NOT in edge_set (true negatives only)
    2. Have similarity < margin (avoid false negatives)
    """
    hard_negs = []
    for pin_id in batch_ids:
        cid = cluster_ids[pin_id]
        candidates = [x for x in all_ids_by_cluster[cid]
                     if (pin_id, x) not in edge_set and x != pin_id]
        hard_negs.append(np.random.choice(candidates, K, replace=False))
    return np.array(hard_negs)

False negative trap: ~70% of top-similar pairs in your corpus ARE related but missing from the graph. Apply similarity margin threshold (<0.7 cosine) to candidates before treating as negatives.

Cost/Lift Table¶

Approach	Training time	GPU cost	Expected Recall@10
Weighted kNN (baseline)	0	$0	baseline
Tune weights on labeled data	1h	$0	+3-7 pts (free win)
Add DINOv3 backbone	1 day	$5	+3-7 pts
Recipe A: MLP + hard negatives	4-8h H100	$5-10	+8-14 pts
Recipe B: LoRA SigLIP 2	10h H100	$15	+12-18 pts total
GraphSAGE-lite	20h H100	$30	+15-22 pts (if multi-hop signal)

Part 3: Evaluation¶

Building the Test Set (do first, before any training)¶

# Sample 200 query pins, stratified by cluster
# For each query: take top 50 candidates from current baseline kNN
# Human-label each (relevant=1, borderline=0.5, irrelevant=0)
# Total: 10,000 judgements, ~10-12h work, save to test_pairs.sqlite
# NEVER include test pins in training edges

Metric	Use when
Recall@10	Primary - "show me 10 relevant results"
NDCG@10	Secondary - rewards ordering
Precision@5	"Hero result" quality

Decision thresholds: - Recall@10 > 55%: ship the baseline, focus on UX - Recall@10 45-55%: train Recipe A - Recall@10 < 45%: debug embeddings first, don't train

Ballpark benchmarks (task-dependent): - Frozen CLIP + CSD + color weighted: ~45-55% - Trained projection head: ~55-65% - LoRA fine-tuned SigLIP 2: ~60-70%

Part 4: Serving¶

Elasticsearch HNSW (recommended for <10M vectors)¶

// dense_vector mapping with int8 quantisation
{
  "mappings": {
    "properties": {
      "projection_vec": {
        "type": "dense_vector",
        "dims": 256,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "int8_hnsw",
          "m": 32,
          "ef_construction": 200
        }
      }
    }
  }
}

# kNN query with num_candidates tuning
body = {
    "knn": {
        "field": "projection_vec",
        "query_vector": query_embedding.tolist(),
        "k": 10,
        "num_candidates": 200  # never below k*10
    }
}

int8 quantisation stats: - Memory: 4x reduction (768-d: 2.5 GB -> 625 MB for 430K; 256-d: much less) - Recall drop: ~1.5% - One mapping change, no code change

Hybrid Retrieval via RRF¶

# ES Reciprocal Rank Fusion: dense visual + sparse tag signals
body = {
    "retriever": {
        "rrf": {
            "retrievers": [
                {"knn": {"field": "projection_vec", "query_vector": vec, "k": 50}},
                {"standard": {"query": {"terms": {"color_tags": query_tags}}}},
                {"standard": {"query": {"terms": {"mood_tags": query_moods}}}}
            ],
            "rank_constant": 60,
            "rank_window_size": 100
        }
    }
}
# Expected lift: 5-15% NDCG over dense-only for visual similarity

Cold Start¶

Content-based embedding = immediate availability for new items. No engagement warmup needed: 1. Compute CSD + CLIP + DINOv3 + color features (~100-300ms on GPU) 2. Pass through projection head (<1ms) 3. Index in ES - available immediately

Gotchas¶

CSD and CLIP have different operating ranges for cosine similarity. CLIP [0.1-0.5] for unrelated, [0.6-0.9] for similar. CSD ranges differ. Normalize both to z-scores before weighted sum, or your weights will be meaningless.
Backbone feature cache is critical. Don't run CLIP/CSD/DINOv3 every epoch - this wastes 90% of GPU on redundant forward passes. Precompute once to Parquet, load during training. At 50M images the difference is 500h vs 30h of training.
Projection head over-parameterization. >3 hidden layers or >2048 hidden dim overfits 7M pairs. Keep it small.
InfoNCE requires L2 normalization. Forgetting the L2 norm before loss computation is the #1 implementation bug.
ES kNN is approximate. Low num_candidates gives fast but imprecise results. Never set below k*10.
Backbone version pinning. Freeze a specific version (e.g., SigLIP-2-ViT-L/14-384-v1) and never auto-update. Upstream backbone drift silently invalidates all indexed vectors.
Test set leakage. Hold out the 200 query pins completely from training edges. Check by pin_id, not just edge membership.