Embeddings¶
Embeddings convert text into high-dimensional numeric vectors where proximity represents semantic similarity. They are the foundation of vector search, RAG systems, and semantic understanding in LLM applications.
Key Facts¶
- Embedding models produce fixed-size vectors (e.g., 1536 dimensions for OpenAI text-embedding-3-large)
- Points close in vector space are semantically similar
- Embeddings capture meaning, not exact words - "car" and "automobile" are close, "bank" (financial) and "bank" (river) are far
- Modern models classify text on ~1000+ abstract features - individual dimensions don't have interpretable meaning
Similarity Metrics¶
Cosine similarity is the standard metric - measures the angle between vectors:
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(a, b):
return dot(a, b) / (norm(a) * norm(b))
# Range: -1 to 1 (1 = identical direction)
Other metrics: Euclidean distance (L2), dot product (when vectors are normalized, equals cosine similarity).
Embedding Models¶
| Model | Dimensions | Provider | Notes |
|---|---|---|---|
| text-embedding-3-large | 1536/3072 | OpenAI | Adjustable dimensions via dimensions param |
| text-embedding-3-small | 512/1536 | OpenAI | Cheaper, lower quality |
| BGE-large | 1024 | BAAI | Open-source, strong performance |
| E5-large | 1024 | Microsoft | Good for retrieval tasks |
| Cohere embed-v3 | 1024 | Cohere | Multilingual, search-optimized |
| Ollama embeddings | Varies | Local | Use same Ollama server, free, private |
Patterns¶
Generate Embeddings (OpenAI)¶
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-large",
input="What is machine learning?"
)
vector = response.data[0].embedding # list of floats
Generate Embeddings (LangChain)¶
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector = embeddings.embed_query("What is machine learning?")
doc_vectors = embeddings.embed_documents(["doc1", "doc2", "doc3"])
Test-Time Reranking¶
After initial embedding retrieval, use a cross-encoder to rerank by fine-grained relevance:
from sentence_transformers import CrossEncoder
cross = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, candidate) for candidate in candidates]
scores = cross.predict(pairs)
best_idx = int(scores.argmax())
Self-Consistency (Best-of-N)¶
Sample N answers, pick most frequent - embeddings can measure answer similarity for clustering:
import collections
def majority_vote(candidates):
return collections.Counter(candidates).most_common(1)[0]
Known Issues¶
- Non-determinism: OpenAI embeddings produce slightly different vectors across API calls for the same text. Small absolute differences but breaks deterministic unit tests.
- Cosine similarity misses: can fail to find obviously present text. String/keyword match succeeds where embedding search falls below typical 0.6 threshold.
- Semantic vs lexical confusion: "risk of liquidity" may match "liquidity amount" even though they mean different things.
- Embedding model must match: query and documents must use the same embedding model. Mixing models produces meaningless similarity scores.
Gotchas¶
- Always use the same embedding model for indexing and querying - mixing models gives garbage results
- Embedding quality degrades for very short texts (1-2 words) and very long texts (beyond model's max input)
- Multilingual embeddings exist but cross-lingual similarity is weaker than same-language
- Embedding API calls add latency and cost to every query - consider caching for repeated queries
- Dimension reduction (e.g., text-embedding-3-large with fewer dimensions) trades quality for speed/cost
See Also¶
- vector databases - Storage and search for embedding vectors
- rag pipeline - How embeddings power retrieval-augmented generation
- tokenization - Text to tokens before embedding
- chunking strategies - Optimal text sizes for embedding