Synthetic Dataset Pipeline for Object Detection¶

Q: Florence-2 open-set detection hallucinates with very small objects

objects <50×50px at 640px resolution get missed or produce garbage boxes. Use SAM3 or manual annotation for micro-details.

Q: BRISQUE scores vary significantly by domain

the default thresholds calibrated on natural images are wrong for product photography (often scored as "low quality" due to clean backgrounds). Calibrate threshold on 50 manual quality-labeled images from your domain.

Q: YOLO imgsz=640 vs 1024

YOLOv8 default is 640. For small objects (gemstones, jewelry detail), training at 1024 improves mAP@0.5:0.95 by 5-15% at 2× memory cost.

★★★★★ Intermediate

Pipeline for building high-quality annotated datasets for YOLO + SAM fine-tuning from raw image collections. Focused on small/specialized object domains (gemstones, products, defects). Five-phase workflow: clean → analyze → annotate → review → train.

Phase 1: Dataset Cleaning¶

1.1 Deduplication (Embedding-Based)¶

import torch
from transformers import CLIPModel, CLIPProcessor
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def get_embeddings(image_paths, batch_size=64):
    embeddings = []
    for i in range(0, len(image_paths), batch_size):
        batch = [Image.open(p) for p in image_paths[i:i+batch_size]]
        inputs = processor(images=batch, return_tensors="pt", padding=True)
        with torch.no_grad():
            feats = model.get_image_features(**inputs)
        embeddings.append(feats.cpu().numpy())
    return np.vstack(embeddings)

# Find duplicates
sim_matrix = cosine_similarity(embeddings)
np.fill_diagonal(sim_matrix, 0)
dup_pairs = np.argwhere(sim_matrix > 0.95)  # threshold configurable
# Keep highest-quality from each pair (resolution × file_size as proxy)

Threshold guidance: - 0.98+: near-exact duplicates only - 0.95: catches most visual duplicates (recommended) - 0.90: aggressive dedup, may remove valid variations

1.2 Quality Filtering¶

from brisque import BRISQUE
import cv2

brisque = BRISQUE()

def quality_filter(image_path, min_resolution=512, max_brisque=40):
    img = cv2.imread(image_path)
    h, w = img.shape[:2]
    if min(h, w) < min_resolution:
        return False
    score = brisque.score(img)
    return score < max_brisque  # lower score = better quality

# Alternative: JPEG artifact detection via DCT block analysis
def has_jpeg_artifacts(img, threshold=5.0):
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY).astype(float)
    blocks = [gray[r:r+8, c:c+8] for r in range(0,gray.shape[0],8) 
              for c in range(0,gray.shape[1],8) if gray[r:r+8,c:c+8].shape == (8,8)]
    dct_blocks = [cv2.dct(b.astype(np.float32)) for b in blocks]
    block_noise = np.mean([np.std(b[4:, 4:]) for b in dct_blocks])
    return block_noise < threshold

Phase 2: Color Analytics¶

For domain-specific object discovery (gemstones, products):

from sklearn.cluster import KMeans
import cv2

def analyze_object_colors(image, mask, n_clusters=8):
    """Extract dominant colors from masked region."""
    masked_pixels = image[mask > 0]
    lab_pixels = cv2.cvtColor(
        masked_pixels.reshape(1, -1, 3).astype(np.uint8),
        cv2.COLOR_RGB2LAB
    )[0]

    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(lab_pixels)

    # Map cluster centers to prompt vocabulary
    return cluster_to_prompt_map(kmeans.cluster_centers_)

Output: color_prompt_map.json - maps color clusters to optimal detection prompts for SAM/Florence2.

Phase 3: Auto-Annotation¶

Method A: SAM3 with Prompt Conditioning (Primary)¶

# SAM3 Pipeline Conditioning (per-image)
def annotate_with_sam3(image_path, prompts, pipeline):
    """
    prompts: list of text prompts from color_prompt_map.json
    pipeline: ComfyUI SAM3 pipeline
    """
    results = []
    for prompt in prompts:
        boxes, masks, scores = pipeline.run(
            image=image_path,
            prompt=prompt,
            output_format="json"
        )
        results.extend(zip(boxes, masks, scores))

    # Keep predictions above confidence threshold
    return [(box, mask, score) for box, mask, score in results if score > 0.85]

Store: JSON per image with boxes + masks (PNG) + scores + prompts + dominant colors.

Method B: Florence-2 + SAM 2.1 (Cross-Validation)¶

from transformers import AutoProcessor, AutoModelForCausalLM
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

# Florence-2: open-vocabulary object detection
florence = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large")
boxes = florence_detect(image, task="<OD>", prompt="gemstone")

# SAM2.1: boxes → pixel-precise masks
sam2 = SAM2ImagePredictor(build_sam2("sam2.1_hiera_large.pt"))
masks = sam2.predict(image, box=boxes, multimask_output=False)

Use for cross-validation against SAM3 annotations. Flag disagreements for human review.

Method C: Caption-Based Vocabulary Expansion¶

# Generate descriptions → extract new prompts → re-annotate
captions = florence_caption(image, task="<CAPTION>")
new_prompts = parse_object_vocabulary(captions)  # NLP extraction
# A/B test new vs existing prompts on held-out images

Phase 4: Review Tool¶

Minimal HTML/JS annotation reviewer (no server required):

// Key features for production review:
// 1. Arrow key navigation (← →)
// 2. Canvas overlay showing all bboxes
// 3. Drag to resize/reposition bbox
// 4. Delete key: remove false positives
// 5. Click canvas: add missed annotations
// 6. Approve/Reject per image (keyboard shortcuts)
// 7. Export to YOLO .txt format

// YOLO export format:
// <class_id> <cx> <cy> <w> <h>  (normalized 0-1)
function exportToYOLO(annotation) {
    const {bbox, class_id, img_w, img_h} = annotation;
    const cx = (bbox.x + bbox.w/2) / img_w;
    const cy = (bbox.y + bbox.h/2) / img_h;
    return `${class_id} ${cx} ${cy} ${bbox.w/img_w} ${bbox.h/img_h}`;
}

Phase 5: YOLO Training¶

# data.yaml
path: /data/dataset
train: images/train
val: images/val
nc: 3  # number of classes
names: ['diamond', 'ruby', 'emerald']

# Training
yolo train \
  model=yolov8m.pt \
  data=data.yaml \
  epochs=100 \
  imgsz=640 \
  batch=16 \
  device=0

# Evaluation
yolo val model=runs/detect/train/weights/best.pt data=data.yaml
# Report: [email protected], [email protected]:0.95, precision, recall

Hardware allocation: - Dedup + quality filter: CPU or consumer GPU (CLIP inference, ~RTX 4090) - SAM3 annotation: H100/H200 for large batches - YOLO training: RTX 4090 adequate for most configs

Dataset Quality Metrics¶

Metric	Target	How to Check
Dedup rate	<5% duplicates remain	CLIP sim histogram
Annotation coverage	>90% true objects annotated	Sample 100, manual count
False positive rate	<10%	Review tool statistics
Inter-annotator agreement	>0.85 IoU on 50 shared images	If multiple annotators
[email protected] on val	>0.7 for specialized domain	YOLO eval output

Gotchas¶

CLIP dedup misses semantic duplicates with different backgrounds: two photos of same ring, one on white velvet, one on marble, get cosine ~0.75 (not caught). Supplement with perceptual hashing for exact duplicates.
SAM3 annotation quality depends heavily on prompt vocabulary: generic prompts ("object", "item") give poor masks. Domain-specific prompts ("round brilliant cut diamond", "oval ruby") improve masks dramatically.
Florence-2 open-set detection hallucinates with very small objects: objects <50×50px at 640px resolution get missed or produce garbage boxes. Use SAM3 or manual annotation for micro-details.
BRISQUE scores vary significantly by domain: the default thresholds calibrated on natural images are wrong for product photography (often scored as "low quality" due to clean backgrounds). Calibrate threshold on 50 manual quality-labeled images from your domain.
YOLO imgsz=640 vs 1024: YOLOv8 default is 640. For small objects (gemstones, jewelry detail), training at 1024 improves [email protected]:0.95 by 5-15% at 2× memory cost.