TIPSv2: Text-Image Pretraining with Spatial Awareness¶

★★★★★ Intermediate

Google DeepMind model for dense spatial feature prediction (depth, surface normals, segmentation) producing per-patch embeddings that remain compatible with text embeddings.

Architecture¶

Full name: Text-Image Pretraining with Spatial awareness v2
Variant covered: B/14 — 86M vision params + 110M text params
Key departure from CLIP: CLIP produces one global vector per image; TIPSv2 produces a per-patch feature map (dense), enabling DPT (Dense Prediction Transformer) decoder heads
Backbone: ViT-B/14 patch size, trained with spatial awareness objectives alongside contrastive text-image alignment
Output: patch-level embeddings at 14×14 pixel stride → direct input to DPT heads for pixel-level tasks

Supported Downstream Tasks¶

image → TIPSv2 backbone → patch features
                              │
                    ┌─────────┼──────────┐
                    ▼         ▼          ▼
              depth map   normals    segmentation mask

All three heads share the same frozen or fine-tuned backbone. Cross-task transfer is possible because the per-patch representation is not task-specific.

Additional use cases¶

Backbone for downstream vision tasks — patch embeddings can be treated as a general dense visual feature extractor, similar to DINOv2 but with text alignment preserved
Multi-modal retrieval with spatial grounding — text embedding space compatibility means region-level text-image matching is possible without extra alignment training

Availability Status (as of April 2026)¶

Resource	Status
`google/tipsv2-b14` on HuggingFace	404 — unavailable (published then pulled)
HF Demo Space `google/tipsv2-gpu-explorer`	Working (Zero GPU, interactive)
ModelScope / Gitee mirrors	Not found
arxiv paper	Not found under TIPSv2; presumed CVPR 2026 submission
TIPS v1 weights	Available — `google-deepmind/tips` on GitHub
TIPS v1 paper	arxiv [2410.16512]

Likely reason for v2 unavailability: weights published ahead of CVPR 2026 camera-ready deadline, then retracted.

TIPS v1 as Fallback¶

v1 paper [2410.16512] covers the same spatial-awareness pretraining concept. Use it if v2 weights remain unavailable.

# Clone v1 code and weights
git clone https://github.com/google-deepmind/tips
cd tips
pip install -r requirements.txt

Core inference pattern from the v1 codebase:

import tips
import torch
from PIL import Image

model = tips.load_model("tips_b14")          # downloads ~350 MB checkpoint
model.eval()

img = Image.open("photo.jpg").convert("RGB")
inputs = tips.preprocess(img)                # returns dict with pixel_values tensor

with torch.no_grad():
    outputs = model(**inputs)

# patch_features: (1, num_patches, embed_dim)
patch_features = outputs.last_hidden_state

DPT head for depth estimation requires an additional fine-tuned decoder; v1 repo ships pre-trained heads for depth and normals.

Model	Output type	Text alignment	Depth	Normals	Segmentation
CLIP ViT-B/16	global embedding	yes	no	no	no
DINOv2 ViT-B/14	patch + global	no	via head	via head	via head
TIPSv2 B/14	patch (dense)	yes	yes	yes	yes
Depth Anything v2	depth map	no	yes	no	no

TIPSv2's differentiator is the combination of dense spatial features and text embedding compatibility. DINOv2 gives better dense features in isolation but has no text alignment.

DPT Head Overview¶

Dense Prediction Transformer (DPT) heads take patch tokens from a ViT backbone and upsample them to full resolution through a multi-scale fusion scheme:

patch tokens (e.g., 37×37 for 518×518 input at /14 stride)
        │
  Reassemble × 4 scales  ← extracts tokens at different depths of ViT
        │
  Fusion blocks (Conv + BN + ReLU)
        │
  Final head (Conv → output channels)
        │
  Dense output (depth / normals / segmentation logits)

Any ViT producing patch-level tokens can drive a DPT head; TIPSv2 improves head quality by providing spatially-aware features rather than purely semantic ones.

Gotchas¶

HF weights 404: google/tipsv2-b14 will raise RepositoryNotFoundError as of April 2026. Fall back to v1 weights (google-deepmind/tips) or use the live demo Space for experimentation only.
Global vs. patch pooling: if you pool TIPSv2 patch features to a single vector for image retrieval, you lose spatial information. Use full patch sequences as input to DPT heads, not pooled vectors.
Patch stride vs. input size: at /14 stride, a 224×224 image gives 16×16 = 256 tokens; a 518×518 image gives ~37×37 = 1369 tokens. Memory scales quadratically — batch large images carefully.
Text compatibility does not mean zero-shot depth: the text-image alignment is for retrieval/grounding, not for prompting depth output. Depth estimation still requires a trained DPT head.
v1 vs. v2 quality gap unknown: no published benchmark comparing v1 and v2 is available while v2 paper is under review. Treat v1 as a reasonable approximation, not a confirmed equivalent.

Monitoring for v2 Release¶

# Poll HF API for model availability
python - <<'EOF'
import requests, json
r = requests.get("https://huggingface.co/api/models/google/tipsv2-b14")
print(r.status_code, json.loads(r.text).get("error", "available"))
EOF

When the model returns 200, download via:

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("google/tipsv2-b14")
processor = AutoProcessor.from_pretrained("google/tipsv2-b14")