Skip to content

Memory Retrieval Patterns

Intermediate

How agents find relevant information in their memory. The retrieval method determines recall quality, latency, and cost. Different approaches work at different scales.

Key Facts

  • Index-based navigation works up to ~500 articles with zero retrieval cost
  • BM25 (keyword) catches exact matches that semantic search misses
  • Vector similarity catches semantic matches that keyword search misses
  • Hybrid (BM25 + vector) with reranking is the most reliable approach for production
  • Lightweight LLM reranking (~$0.001/query) can push recall from 96.6% to ~100%
  • Retrieval latency compounds in agent loops - pre-fetch at plan start, not on every step

Retrieval Methods Comparison

Method Scale Cost/Query Recall Latency Best For
Index navigation <500 articles ~0 Depends on index quality <100ms Small knowledge bases
Keyword/BM25 Any ~0 High for exact matches <10ms Known terminology
Vector similarity Any Embedding cost High for semantic 10-100ms Fuzzy/conceptual queries
Hybrid (BM25+vector) Any Embedding cost Highest 20-150ms Production systems
LLM rerank Any ~$0.001 Near-perfect +200-500ms Critical accuracy needs

Index-Based Navigation

The simplest retrieval: agent reads an index, picks relevant files, reads them.

# Memory Index (agent reads this first)

## User Preferences
- [db-preference](mem/db-preference.md) - PostgreSQL, pgvector, reasons
- [editor](mem/editor.md) - Neovim, plugins, keybindings

## Current Projects
- [api-rewrite](mem/api-rewrite.md) - GraphQL migration, Q2 deadline

How it works: The index fits in context (~500 tokens for 100 entries). The agent pattern-matches the user's query against index entries and reads the relevant files. No embeddings, no search infrastructure.

Scaling: At ~500 entries, the index itself exceeds comfortable context size. Split into domain indexes or add search.

Keyword Search (BM25)

Term frequency-based retrieval. Finds documents containing the query's exact words.

from rank_bm25 import BM25Okapi

class KeywordMemory:
    def __init__(self, documents: list[str]):
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
        self.documents = documents

    def search(self, query: str, top_k: int = 5) -> list[str]:
        tokens = query.lower().split()
        scores = self.bm25.get_scores(tokens)
        top_indices = scores.argsort()[-top_k:][::-1]
        return [self.documents[i] for i in top_indices if scores[i] > 0]

Strengths: Fast, free, deterministic, catches exact terminology. Weaknesses: Misses semantically similar but lexically different content ("car" won't find "automobile").

Embed query and documents, find closest vectors. See embeddings and vector databases for details.

import chromadb

collection = client.get_collection("memory")

results = collection.query(
    query_texts=["What database does the user prefer?"],
    n_results=5,
    where={"valid_to": None}  # only current facts
)

Strengths: Semantic understanding, handles paraphrasing, works across languages. Weaknesses: Misses exact keyword matches sometimes, non-deterministic, requires embedding infrastructure.

Combine BM25 and vector search. Two merging strategies:

Weighted Sum

def hybrid_search(query: str, alpha: float = 0.5, top_k: int = 5):
    bm25_scores = get_bm25_scores(query)      # normalized 0-1
    vector_scores = get_vector_scores(query)    # normalized 0-1

    combined = {}
    for doc_id in set(bm25_scores) | set(vector_scores):
        combined[doc_id] = (
            alpha * vector_scores.get(doc_id, 0) +
            (1 - alpha) * bm25_scores.get(doc_id, 0)
        )

    return sorted(combined.items(), key=lambda x: -x[1])[:top_k]

Problem: alpha needs tuning per domain. 0.5 is rarely optimal.

Reciprocal Rank Fusion (RRF)

Merge by rank position, no alpha tuning needed:

def rrf_merge(ranked_lists: list[list], k: int = 60) -> list:
    """Merge multiple ranked lists using Reciprocal Rank Fusion."""
    scores = {}
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (rank + k)
    return sorted(scores.items(), key=lambda x: -x[1])

RRF is preferred because it doesn't require score normalization or alpha tuning. Works with any number of retrieval methods.

LLM Reranking

After initial retrieval, use an LLM to rerank candidates by relevance:

def llm_rerank(query: str, candidates: list[str], top_k: int = 3) -> list[str]:
    prompt = f"""Rate each candidate's relevance to the query (0-10).
Query: {query}

Candidates:
{chr(10).join(f'{i+1}. {c[:200]}' for i, c in enumerate(candidates))}

Return JSON: [{{"index": 1, "score": 8}}, ...]"""

    scores = llm.complete(prompt, model="haiku")  # cheap model suffices
    ranked = sorted(scores, key=lambda x: -x["score"])
    return [candidates[s["index"]-1] for s in ranked[:top_k]]

Cost: ~$0.001/query with a small model. Pushes recall from 96.6% to ~100% on benchmarks.

Alternative: Cross-encoder reranking (no LLM needed):

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)

Cross-encoders are faster and cheaper than LLM reranking but less flexible.

Patterns

Pre-Fetch at Plan Start

Don't retrieve on every agent step. Batch retrieval at the start of each plan phase:

def start_plan_phase(phase: str, subtasks: list[str]):
    # Pre-fetch all likely-needed context
    queries = [phase] + subtasks
    context = {}
    for q in queries:
        context[q] = memory.search(q, top_k=3)

    # Inject aggregated context once
    return format_context(context)

This avoids retrieval latency on every LLM call and keeps the agent's flow uninterrupted.

Metadata-Filtered Retrieval

Narrow the search space before similarity computation:

results = collection.query(
    query_texts=["deployment configuration"],
    n_results=5,
    where={
        "$and": [
            {"wing": "project-api"},
            {"valid_to": None},
            {"room": {"$in": ["infrastructure", "deployment"]}}
        ]
    }
)

Metadata filtering happens before vector search in most databases - design your metadata schema carefully.

Cost Comparison

For 1000 queries/month against 10K documents:

Method Infrastructure Per-Query Monthly Total
Index navigation ~$0 (files) ~$0 ~$0
BM25 only ~$5 (compute) ~$0 ~$5
Vector only ~$20 (vector DB) ~$0.0005 ~$20.50
Hybrid + rerank ~$20 (vector DB) ~$0.001 ~$21

The cost difference between approaches is small. Choose by recall quality, not price.

Gotchas

  • Vector search can miss obviously present text. Cosine similarity sometimes fails where keyword search succeeds trivially. Always combine with BM25 for critical applications. See embeddings for known similarity failures
  • Retrieval latency compounds in agent loops. An agent that runs 20 steps with vector search on each step adds 2-4 seconds of retrieval latency. Pre-fetch at plan start or cache results within a task. Batch retrieval queries where possible
  • Metadata filtering design is permanent. Changing metadata schema requires re-indexing all documents. Plan your metadata fields (wing, room, type, valid_from, valid_to) before initial indexing. Adding new fields later means re-processing everything

See Also