Image Similarity at Scale: Architecture Decisions from 430K to 50M¶

★★★★★ Advanced

Infrastructure and architecture decisions for image similarity systems as they grow. Each threshold triggers different bottlenecks. Make forward-compatible decisions early - the cost of retrofitting grows non-linearly.

Hard Limits by Scale¶

Elasticsearch HNSW¶

Scale	RAM (int8, 3 indices, 768-d)	p99 kNN latency	Filtered kNN	Verdict
430K	~2 GB	<50ms	OK	Comfortable
5M	~24 GB	80-150ms	Degrades	Watchable, GC pressure
10M	~48 GB	150-300ms	Slow	Upper comfort boundary
50M	~240 GB	500-1500ms	Poor	Not recommended

ES formula: num_vectors × (dims + 4) per index + 30% HNSW graph overhead. Java GC pauses blow p99 under concurrent load - Qdrant (Rust) does not have this problem.

ES stays for 430K-5M. Migrate between 5-10M.

File System Layout¶

NTFS/ext4 performance collapses past ~100K entries per directory. At 50M flat files: Explorer never opens, DIR takes hours, backup fails.

Do this NOW at 430K:

import hashlib
import os
import shutil

def get_shard_path(base_dir, pin_id, filename):
    """2-level hex sharding: base/ab/cd/pin-abcdef.jpg"""
    hash_prefix = hashlib.md5(str(pin_id).encode()).hexdigest()[:4]
    return os.path.join(base_dir, hash_prefix[:2], hash_prefix[2:], filename)

def migrate_flat_to_sharded(source_dir, target_dir):
    """Migrate 430K files in minutes. At 10M: hours. At 50M: days."""
    for filename in os.listdir(source_dir):
        pin_id = filename.split('-')[1].split('.')[0]
        dest = get_shard_path(target_dir, pin_id, filename)
        os.makedirs(os.path.dirname(dest), exist_ok=True)
        shutil.move(os.path.join(source_dir, filename), dest)

Result: 65,536 leaf folders, ~760 files per folder at 50M. This is manageable on every filesystem.

Database¶

SQLite: single global write lock. Fine for 430K with sequential workers. Breaks when embedding + scraping + cleanup run concurrently at 5M+.
PostgreSQL at 5M: pg_loader migration in an afternoon if schema is PostgreSQL-compatible now.
50M: PostgreSQL + Parquet snapshots for 2.5B edges (200 GB). PostgreSQL for live queries, Parquet for training.

Vector DB at 50M Scale¶

Qdrant (recommended for single-team projects)¶

Rust-based, no GC, consistent p99. Key capabilities at 50M:

# Single-machine Qdrant with int8 + on-disk storage
# 64 GB RAM box handles 400M int8 vectors (on-disk HNSW, RAM for index + quantised vecs)

# Collection config for 50M × 256-d projection head output
vectors_config:
  size: 256
  distance: Cosine
  quantization_config:
    scalar:
      type: int8
      always_ram: true  # keep quantised vecs in RAM, originals on disk
hnsw_config:
  m: 16
  ef_construct: 200
  on_disk: true  # HNSW graph on NVMe, not RAM

Self-hosted cost at 50M (Hetzner AX102, 128 GB RAM, 2x 1.92 TB NVMe): ~$130/month. Managed Qdrant Cloud for same: ~$450-1200/month.

ACORN algorithm (2025) handles tight-filter kNN much better than ES: adaptively switches between graph traversal and brute-force based on filter selectivity.

pgvectorscale (single-system option)¶

If metadata is already in PostgreSQL, consolidate:

-- pgvectorscale: DiskANN index
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

CREATE TABLE pin_embeddings (
    pin_id BIGINT PRIMARY KEY,
    projection_vec VECTOR(256)
);

CREATE INDEX ON pin_embeddings USING diskann (projection_vec);
-- Published benchmark: 471 QPS at 99% recall on 50M × 768-d Cohere vectors

One system instead of two. Trades some raw performance for operational simplicity.

When to Use What¶

Need	Choice
430K-5M, already on ES	Stay on ES
50M, single team	Qdrant (self-hosted)
50M, need PostgreSQL for metadata too	pgvectorscale
100M+ QPS at scale	Milvus
Never	Managed Pinecone/Zilliz ($3-6K/month vs $300 self-hosted)

Training Pipeline at Scale¶

Precomputed Feature Cache (critical at 5M+)¶

At 50M images, per-epoch backbone inference costs ~$600/epoch-round. The fix: extract once, train the projection head on cached features.

# Stage 1: one-time feature extraction
# Output: sharded Parquet, ~200 GB float16 for 50M images × 3 backbones

import pyarrow as pa
import pyarrow.parquet as pq
from torch.utils.data import DataLoader

def extract_features_to_parquet(
    image_ids, image_loader, clip_model, csd_model, dinov3_model,
    output_dir, shard_size=1_000_000
):
    batch_records = []
    for pin_id, img in image_loader:
        clip_vec = clip_model.encode(img)
        csd_vec = csd_model.encode(img)
        dinov3_vec = dinov3_model.encode(img)
        color_vec = extract_color_features(img)
        batch_records.append({
            'pin_id': pin_id, 'clip': clip_vec, 'csd': csd_vec,
            'dinov3': dinov3_vec, 'color': color_vec
        })
        if len(batch_records) >= shard_size:
            write_parquet_shard(batch_records, output_dir)
            batch_records = []

# Stage 2: training (runs in minutes, not hours)
# Load feature Parquet, join with edge pairs, train MLP head only

Training time with precomputed features:

Scale	Training time	GPU cost
430K	~2h H100	$5
5M	~8h H100	$20
50M	~30h H100	$80
50M (raw images, no cache)	~500h H100	$1,500

Design the feature cache now at 430K. It takes 1 day to implement. Retrofitting the training loop at 10M takes 2 weeks.

MoCo Memory Bank for 50M¶

In-batch negatives (batch 1024) are too small at 50M. MoCo-style memory bank gives 65K negatives on single GPU:

class MoCoSimilarityTrainer:
    def __init__(self, model, queue_size=65536, momentum=0.999, temperature=0.07):
        self.model = model
        self.model_k = copy.deepcopy(model)  # momentum encoder
        self.queue = F.normalize(torch.randn(queue_size, 256), dim=1)
        self.queue_ptr = 0
        self.momentum = momentum
        self.temperature = temperature

    def update_momentum_encoder(self):
        for p, pk in zip(self.model.parameters(), self.model_k.parameters()):
            pk.data = self.momentum * pk.data + (1 - self.momentum) * p.data

    def infonce_with_queue(self, q, k):
        """q, k: L2-normalised [B, D]. Queue: [Q, D]."""
        l_pos = torch.einsum('nd,nd->n', q, k).unsqueeze(-1)  # [B, 1]
        l_neg = torch.einsum('nd,qd->nq', q, self.queue.clone().detach())  # [B, Q]
        logits = torch.cat([l_pos, l_neg], dim=1) / self.temperature
        labels = torch.zeros(q.size(0), dtype=torch.long, device=q.device)
        loss = F.cross_entropy(logits, labels)
        self._dequeue_and_enqueue(k)
        return loss

Batch 256 + 65K queue = 65K effective negatives per step. Fits single GPU. No need for 4096+ batch size.

Storage Architecture at 50M¶

Object Storage Comparison (25 TB for 50M × 500 KB images)¶

Provider	Monthly cost	Egress	Notes
Wasabi	~$175	Free	90-day minimum retention lock
Cloudflare R2	~$375	Zero	Best for hot/CDN tier
Backblaze B2	~$375	3x free	Good backup tier
AWS S3 Standard	~$575 + egress	$0.09/GB	Never for this workload

Two-tier at 50M: 1. Wasabi: original full-res images (cold archive) 2. Cloudflare R2: thumbnails (236px + 474px) - 50M × 50 KB = 2.5 TB = $37/month

Total storage: ~$215/month for everything at 50M.

Graph Edge Storage at 2.5B Edges¶

-- PostgreSQL: live queries ("what's related to pin X?")
CREATE TABLE pin_related (
    src_id BIGINT NOT NULL,
    dst_id BIGINT NOT NULL,
    weight FLOAT DEFAULT 1.0,
    hop_level SMALLINT,
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON pin_related (src_id) INCLUDE (dst_id, weight);

-- 2.5B rows ~ 200 GB. PostgreSQL handles this fine.
-- Add partitioning by src_id hash if needed.

Parquet snapshots on Wasabi/R2 for training full-dataset reads - avoids crushing live DB.

Migration Path¶

Phase 0 (NOW, at 430K) - Forward-Compatible Setup¶

File sharding: migrate to 2-level hex layout. Minutes at 430K, days at 50M.
Feature cache pipeline: write backbone outputs to Parquet per batch. Even unused now - design it right.
VectorIndex abstraction: single class wrapping ES today, Qdrant tomorrow.

class VectorIndex:
    """Abstraction over ES / Qdrant / pgvector. Change backend, not callers."""
    def upsert(self, id: str, vector: np.ndarray, metadata: dict): ...
    def knn_search(self, query_vec: np.ndarray, k: int, filters: dict) -> list: ...
    def hybrid_search(self, query_vec: np.ndarray, sparse_query: dict, k: int) -> list: ...

Image URL abstraction: get_image_url(pin_id) helper - single place to swap filesystem for S3/R2 later.
PostgreSQL-compatible SQLite schema: use strict column types, no SQLite-isms. pg_loader migration takes an afternoon with compatible schema, weeks without.

Phase 1 (1-5M, ~6 months)¶

PostgreSQL migration for metadata + edges
Wasabi cold storage for originals
Move feature cache to Parquet as primary (not EXIF)
Pilot Qdrant with 1M sample vs ES, compare latency

Phase 2 (10-50M)¶

Full Qdrant migration via alias swap
R2 hot tier for thumbnails
MoCo training loop
Distributed feature extraction (4 GPUs = 25h wall-clock vs 100h single-GPU)

Cost Summary¶

Scale	Servers	Storage	Training/month	Total/month
430K	~$20 VPS	~$0 local	~$5/run	~$20-25
5M	~$95 (PG + ES)	~$35 Wasabi	~$20/run	~$130-150
50M	~$185 (Qdrant + PG)	~$215	~$80/run	~$400-600
50M managed	~$2000-5000 vector DB	~$500	~$80/run	~$2600-5600

Self-hosted is 5-10x cheaper than managed at 50M.

Gotchas¶

HNSW deletes are soft. Deleted vectors leave ghost edges in the graph. ES requires periodic merge/optimise if you delete >10% of data. Qdrant handles this better but still needs periodic optimization.
Model versioning is critical. When you retrain and re-index, keep old and new vectors accessible for A/B testing. Schema: projection_v1, projection_v2. Rolling out without A/B loses all quality measurement.
ES free tier limits kNN features. Some int8 HNSW options require paid tier. Check license before designing around them.
SQLite single-writer lock hits earlier than expected. At 5M with 3+ concurrent workers (scraper + embedder + cleanup) you will see "database is locked" under sustained load even with WAL mode.
Thumbnail generation is one-time. Generate 236px + 474px thumbnails during ingestion and cache forever. Don't regenerate on serve - this is the biggest serving latency contributor at scale.
pgvectorscale benchmark caveat. The 471 QPS vs 41 QPS Qdrant benchmark uses pgvectorscale's sweet spot. At high concurrency and with complex filters, Qdrant's ACORN approach typically wins. Test both for your access pattern.