Token Optimization for Agents¶

★★★★★ Advanced

Reducing token consumption in agent systems without degrading task performance. Agent costs scale with tokens (input + output) x number of calls x number of agents. A 75% token reduction in a 5-agent system with 20 calls each = 75x fewer tokens billed.

Response Compression Techniques¶

Caveman Prompting¶

Force terse, keyword-dense responses by instructing the model to use minimal grammar:

System: Reply in compressed style. No articles, no filler words.
Skip "the", "a", "is", "are". Use arrows, abbreviations.
Max info per token. Like telegram.

User: What are the main causes of the 2008 financial crisis?

Normal response (180 tokens):
"The 2008 financial crisis was primarily caused by several
interconnected factors. First, there was a massive expansion of
subprime mortgage lending..."

Compressed response (45 tokens):
"2008 crisis causes: subprime mortgage expansion -> securitization
(MBS/CDO) -> credit rating failures -> overleveraged banks ->
Lehman collapse -> credit freeze -> global contagion"

Savings: 75% token reduction. Works best for factual/analytical tasks. Quality degrades on creative or nuanced tasks.

Structured Output Compression¶

Replace natural language agent responses with structured formats:

# Instead of: "I searched the database and found 3 results.
# The most relevant one is the user profile for John with ID 12345..."

# Use:
{"action": "search_db", "results": 3, "top": {"id": 12345, "type": "user"}}

Scratchpad Compression¶

Agent reasoning traces consume most tokens. Compress intermediate steps:

# Verbose scratchpad (typical):
"""
Thought: I need to find the error in the code. Let me look at the
stack trace first. The error message says "IndexError: list index
out of range" which means we're trying to access an element at an
index that doesn't exist. Let me check the function where this occurs.
Action: read_file("main.py")
Observation: [200 lines of code]
Thought: I can see the issue. On line 45, we access arr[i] but the
loop goes from 0 to len(arr) inclusive, which means when i equals
len(arr), we get an out of bounds error...
"""

# Compressed scratchpad:
"""
T: IndexError line 45: loop 0..len(arr) inclusive, off-by-one
A: read_file("main.py")
O: [relevant lines only: 43-47]
T: Fix: range(len(arr)) not range(len(arr)+1)
"""

Implementation: set a word budget in the system prompt:

Reasoning budget: max 30 words per thought. Use abbreviations.
T=thought, A=action, O=observation (first 5 relevant lines only).

Cost-Aware Architecture Patterns¶

Model Cascading¶

Route cheap tasks to cheap models, expensive tasks to expensive models:

def route_to_model(task: dict) -> str:
    """Select model based on task complexity."""
    complexity = estimate_complexity(task)
    if complexity < 0.3:
        return "haiku"          # ~$0.25/M tokens
    elif complexity < 0.7:
        return "sonnet"         # ~$3/M tokens
    else:
        return "opus"           # ~$15/M tokens

def estimate_complexity(task: dict) -> float:
    """Heuristic complexity scoring."""
    score = 0.0
    if task.get("requires_reasoning"):
        score += 0.3
    if task.get("multi_step"):
        score += 0.3
    if task.get("code_generation"):
        score += 0.2
    if len(task.get("context", "")) > 10000:
        score += 0.2
    return min(score, 1.0)

Subagent Token Budget¶

Enforce per-subagent token limits:

class TokenBudgetAgent:
    def __init__(self, max_input: int = 4000, max_output: int = 1000):
        self.max_input = max_input
        self.max_output = max_output

    def run(self, task: str, context: str) -> str:
        # Truncate context to budget
        context = truncate_to_tokens(context, self.max_input - count_tokens(task))

        response = llm.generate(
            messages=[{"role": "user", "content": f"{task}\n\nContext:\n{context}"}],
            max_tokens=self.max_output,
        )
        return response

Caching and Deduplication¶

import hashlib
from functools import lru_cache

# Cache identical tool call results
@lru_cache(maxsize=256)
def cached_tool_call(tool_name: str, params_hash: str):
    return execute_tool(tool_name, unhash(params_hash))

# Deduplicate context across agent calls
def deduplicate_context(messages: list[dict]) -> list[dict]:
    """Remove duplicate tool observations."""
    seen_observations = set()
    deduped = []
    for msg in messages:
        if msg["role"] == "tool":
            h = hashlib.md5(msg["content"].encode()).hexdigest()
            if h in seen_observations:
                continue
            seen_observations.add(h)
        deduped.append(msg)
    return deduped

Prompt Caching¶

Modern APIs cache repeated prompt prefixes. Structure prompts so the stable portion comes first:

messages = [
    # STABLE PREFIX (cached after first call - 90% cheaper on subsequent)
    {"role": "system", "content": long_system_prompt},      # ~2000 tokens
    {"role": "user", "content": reference_documents},        # ~5000 tokens

    # VARIABLE SUFFIX (changes each call)
    {"role": "user", "content": current_task},               # ~200 tokens
]
# First call: 7200 tokens at full price
# Subsequent calls: 200 tokens at full price + 7000 at 10% price

Measuring Token Efficiency¶

def token_efficiency_report(agent_runs: list[dict]) -> dict:
    """Calculate token efficiency metrics across runs."""
    total_tokens = sum(r["tokens_used"] for r in agent_runs)
    successful = [r for r in agent_runs if r["success"]]
    failed = [r for r in agent_runs if not r["success"]]

    return {
        "total_tokens": total_tokens,
        "tokens_per_success": total_tokens / max(len(successful), 1),
        "tokens_per_task": total_tokens / len(agent_runs),
        "waste_ratio": sum(r["tokens_used"] for r in failed) / total_tokens,
        "avg_steps": sum(r["steps"] for r in agent_runs) / len(agent_runs),
        "cost_usd": total_tokens * 0.000003,  # adjust per model
    }

Track tokens_per_success over time - it should decrease as you optimize. A high waste_ratio means the agent burns tokens on doomed attempts - improve early failure detection.

Gotchas¶

Aggressive compression kills nuance: caveman prompting works for factual extraction but causes the model to miss subtleties in analysis tasks. Test quality metrics before deploying compressed prompts in production - measure pass rate, not just token count
Scratchpad truncation causes reasoning failures: if you cut the reasoning trace too aggressively, the model loses track of its chain-of-thought. Start with a 50% compression target and measure task success rate. Never compress below the point where success rate drops
Caching stale results: tool call caching can serve outdated data if the underlying state changed between calls. Add TTL to cached results and invalidate on state-changing actions (write, delete, update)
Model cascading routing errors are expensive: sending a hard task to a cheap model wastes those tokens entirely when it fails, then you pay again for the expensive model. When in doubt, route up not down