Skip to content

Context Window Management

Advanced

Strategies for managing what enters the LLM's context window and when. The context window is finite working memory - everything the agent can reason about at once. Poor management is the primary cause of agent degradation on long tasks.

Key Facts

  • Context windows range from 4K to 2M tokens, but attention quality degrades in the middle ("lost in the middle" effect)
  • Cost scales linearly with context size - 128K context at $3/M input = $0.38 per call
  • Compaction (summarization) is irreversible - details are permanently lost once summarized
  • Layered loading (always-on core + on-demand details) keeps base cost at ~170 tokens while enabling access to full memory
  • Context anxiety: agents start rushing/shortcutting when they perceive context is filling up, even if there's room left

Layered Loading Pattern

Divide memory into layers by criticality. Lower layers are always present; higher layers load on demand.

L0 - Identity (~50 tokens)     ← ALWAYS in context
  "You are an assistant for Project X. User is Alice, senior engineer."

L1 - Critical Facts (~120 tokens) ← ALWAYS in context
  "Tech stack: Python 3.12, PostgreSQL 16, deployed on K8s."
  "User prefers: concise answers, code examples, no emojis."

L2 - Topic Memory (variable)    ← Loaded when topic detected
  Retrieved from vector store based on current query.
  Budget: ~2000 tokens per retrieval.

L3 - Full History (variable)    ← Loaded on explicit request
  Complete conversation logs, raw documents.
  Only when agent specifically needs to review past interactions.

Cost comparison: L0+L1 always loaded = ~170 tokens/call = ~$10/year. Full summarization approach = ~$500/year. The layered approach is 50x cheaper.

Token Budget Allocation

class TokenBudget:
    def __init__(self, max_tokens: int = 128_000):
        self.allocations = {
            "system_prompt": 0.08,    # 8% - instructions, tool defs
            "identity_l0_l1": 0.02,   # 2% - always-on memory layers
            "state": 0.10,            # 10% - current plan, progress
            "retrieved_memory": 0.15, # 15% - L2 on-demand context
            "conversation": 0.25,     # 25% - recent exchanges
            "tool_results": 0.20,     # 20% - outputs from tools
            "generation": 0.20,       # 20% - reserved for response
        }

    def available(self, category: str) -> int:
        return int(self.max_tokens * self.allocations[category])

Rule of thumb: Reserve 20% for generation. If tool results are large, summarize them before inserting into context.

Compaction Strategies

Summarize-and-Replace

Replace old conversation history with a compressed summary:

def compact(messages: list, keep_recent: int = 6) -> list:
    system = messages[0]
    recent = messages[-keep_recent:]
    old = messages[1:-keep_recent]

    if not old:
        return messages

    summary = llm.complete(f"""Summarize preserving:
    1. Key decisions made
    2. Important findings
    3. Failed approaches (DO NOT retry these)
    4. Current state of task
    History: {format_messages(old)}""")

    return [system,
            {"role": "system", "content": f"[Summary of earlier conversation]\n{summary}"},
            *recent]

State File + Context Reset

For long tasks, prefer full context reset over compaction. Write state to files, start fresh:

# Before reset
state = {
    "goal": "Migrate API from REST to GraphQL",
    "completed": ["schema design", "resolver stubs", "auth middleware"],
    "current": "pagination implementation",
    "findings": ["cursor pagination is 3x faster than offset for our dataset"],
    "failed": ["relay-style connections - too complex for our use case"],
    "blockers": []
}
save_json(".agent/state.json", state)

# After reset - agent reads state file, continues from where it left off
# No information loss from summarization

Context reset > compaction for tasks longer than ~30 minutes. Compaction preserves continuity but accumulates information loss. A fresh context with explicit state is cleaner.

Re-injection After Compaction

If compaction is unavoidable, immediately re-inject critical state:

def post_compaction_message(state_file: str) -> str:
    state = load_json(state_file)
    return f"""[Context was compacted. Current state:]

## Goal
{state['goal']}

## Completed Steps
{chr(10).join(f'- {s}' for s in state['completed'])}

## Current Step
{state['current']}

## Failed Approaches (DO NOT retry)
{chr(10).join(f'- {f}' for f in state['failed'])}

[Continue from current step. State above is authoritative.]"""

Auto-Save Triggers

Save memory state automatically at key points:

  • Every N messages (e.g., every 15 messages) - periodic checkpoint
  • Before compaction - capture full state before lossy operation
  • Before context reset - handoff artifact for next session
  • On topic change - close out current topic's findings
  • On error/failure - record what went wrong and why

Anti-Patterns

Anti-Pattern Problem Fix
Stuffing everything in context "Lost in the middle" - model ignores middle Layered loading, JIT retrieval
No external state Compaction kills critical details File-based state management
Reactive retrieval every call Latency compounds Pre-fetch at plan phase start
Ignoring token costs $3.80 per 10 calls at 128K Shorter context when possible
Trusting compaction output Summaries miss nuance Always verify critical facts post-compaction

Gotchas

  • "Lost in the middle" is measurable and consistent. LLMs attend strongly to the beginning and end of context but poorly to the middle. Place critical information (current task, plan, constraints) at the start, recent observations at the end. Never bury important state in the middle of a long history
  • Compaction loses information irreversibly. Before compacting, save full state to files. After compaction, re-inject the structured state so the agent knows where it stands. Test by asking the agent questions about pre-compaction work - if it can't answer, your re-injection is incomplete
  • Auto-compact can loop. If context refills quickly after compaction (e.g., large tool outputs), the agent can enter a compact-refill-compact cycle. Set a circuit breaker: after 3 compactions without progress, stop and report

KV Cache Compression (2026)

Hardware-level techniques to reduce memory footprint of context windows during inference. Orthogonal to prompt-level compression - these work at the attention mechanism level.

TriAttention (April 2026) - SOTA

Exploits the discovery that pre-RoPE Q/K vectors concentrate around fixed centers across attention heads. Uses trigonometric series to estimate key importance without needing recent queries.

How it works:
  1. Trigonometric Series Scoring (S_trig):
     Estimates key importance via Q/K centers + positional distance
     Operates in pre-RoPE space where vectors are stable

  2. Norm-Based Scoring (S_norm):
     Complementary signal for low-concentration heads

  3. Adaptive Weighting:
     Auto-balances using Mean Resultant Length (R) metric

Results:
  10.7x KV memory reduction matching full attention accuracy (AIME25 40.8%)
  2.5x throughput at equivalent accuracy
  6.3x speedup on MATH 500 (1405 vs 223 tokens/sec)
  RULER retrieval: 66.1 vs SnapKV 55.6

Deployment: vLLM plugin (auto-discovery, zero code changes)
Validated on: Qwen3-8B, DeepSeek-R1-Distill, GPT-OSS-20B, GLM-4.7-Flash

Previous KV compression methods (H2O, SnapKV, R-KV) use "limited observation windows" - only recent queries maintain representative orientations. TriAttention operates in pre-RoPE space where vectors are inherently stable, providing intrinsic importance signals.

TurboQuant (ICLR 2026)

Production-ready KV cache quantization integrated into vLLM.

Method: Hadamard rotation + Lloyd-Max scalar quantization + outlier-aware allocation
Effect: bf16 -> packed 4-bit uint8 (4x memory reduction)
Usage: --kv-cache-dtype option in vLLM
Quality: minimal degradation on standard benchmarks

Other Compression Approaches

Method Type Compression Use Case
TriAttention KV eviction 10.7x Inference, reasoning tasks
TurboQuant KV quantization 4x Production serving (vLLM)
NVIDIA KVPress KV compression toolkit Variable Benchmarking strategies
CompLLM Soft context compression 2x Long context Q&A, 4x TTFT speedup
LoPace Lossless prompt storage 4.89x avg Prompt caching, not inference

Practical Deployment

Strategy for maximum context efficiency:
  1. TurboQuant (4-bit KV cache) - always-on, minimal quality impact
  2. TriAttention (selective eviction) - for reasoning/long context
  3. Both combined: ~40x effective KV compression

Hardware requirements:
  TriAttention: tested on A100 80GB, bfloat16
  TurboQuant: any vLLM-supported GPU
  Combined: enables 128K context on consumer GPUs (24GB)

See Also