Skip to content

Agent Evaluation and Benchmarks

Evaluating agents is fundamentally harder than evaluating models. Agents have stochastic behavior, multi-step execution, tool interactions, and real-world side effects. A model that scores 90% on a benchmark may produce a 50% success rate agent.

Evaluation Dimensions

Task Completion

Binary: did the agent accomplish the goal?

def evaluate_task_completion(agent, test_cases):
    results = []
    for case in test_cases:
        output = agent.run(case["input"])
        success = case["validator"](output)
        results.append({
            "task": case["name"],
            "success": success,
            "steps": output.step_count,
            "cost": output.total_cost,
            "time": output.elapsed_seconds,
        })

    pass_rate = sum(r["success"] for r in results) / len(results)
    avg_cost = sum(r["cost"] for r in results) / len(results)
    return {"pass_rate": pass_rate, "avg_cost": avg_cost, "results": results}

Efficiency Metrics

  • Steps to completion: fewer is better (less cost, less room for error)
  • Tool calls: unnecessary tool calls waste tokens and time
  • Token usage: total input + output tokens consumed
  • Wall clock time: end-to-end latency
  • Cost: total API spend per task

Quality Metrics

  • Correctness: output matches expected result
  • Completeness: all parts of the task addressed
  • Safety: no harmful actions taken
  • Robustness: handles edge cases and errors gracefully

Benchmark Suites

SWE-Bench

Real GitHub issues resolved by agents. Gold standard for coding agents.

  • SWE-Bench Lite: 300 issues, more tractable
  • SWE-Bench Verified: human-verified subset
  • Metric: % of issues resolved (patch applies + tests pass)
  • Current SOTA: ~50% on Verified (as of early 2026)

GAIA

General AI Assistants benchmark. Multi-step reasoning with tool use.

  • 3 difficulty levels
  • Requires web search, file reading, code execution
  • Tests compositional reasoning across tools

HumanEval / MBPP

Code generation benchmarks. Simpler than SWE-Bench (function-level, not repo-level).

WebArena / VisualWebArena

Web navigation benchmarks. Agent completes tasks in real websites.

AgentBench

Multi-environment benchmark: OS, database, web, game environments.

Building Custom Evals

Deterministic Validators

# Exact match
def validate_exact(output, expected):
    return output.strip() == expected.strip()

# Contains required elements
def validate_contains(output, required_elements):
    return all(elem in output for elem in required_elements)

# Code execution test
def validate_code(output, test_cases):
    exec_env = {}
    try:
        exec(output, exec_env)
        for inp, expected_out in test_cases:
            assert exec_env["solution"](inp) == expected_out
        return True
    except (AssertionError, Exception):
        return False

LLM-as-Judge

When deterministic validation is impossible:

def llm_judge(task, agent_output, rubric):
    prompt = f"""
    Task: {task}
    Agent output: {agent_output}

    Evaluate on this rubric (1-5 scale for each):
    {rubric}

    Return JSON: {{"scores": {{"criterion": score}}, "explanation": "..."}}
    """
    judgment = judge_llm(prompt)
    return json.loads(judgment)

# Calibration: always include anchor examples
rubric = """
1. Correctness (1-5): Does the answer match the expected outcome?
   1 = completely wrong, 3 = partially correct, 5 = fully correct
2. Efficiency (1-5): Did the agent use minimal steps?
   1 = excessive tool calls, 3 = reasonable, 5 = optimal path
3. Safety (1-5): Were any dangerous actions taken?
   1 = destructive action, 3 = minor risk, 5 = completely safe
"""

Trajectory Evaluation

Evaluate the process, not just the outcome:

def evaluate_trajectory(trajectory):
    scores = {
        "redundant_calls": count_redundant_tool_calls(trajectory),
        "error_recovery": count_successful_recoveries(trajectory),
        "planning_quality": judge_plan_coherence(trajectory),
        "tool_selection_accuracy": correct_tools / total_tools,
    }
    return scores

def count_redundant_tool_calls(trajectory):
    """Detect repeated identical tool calls."""
    seen = set()
    redundant = 0
    for step in trajectory:
        key = (step.tool, frozenset(step.params.items()))
        if key in seen:
            redundant += 1
        seen.add(key)
    return redundant

Statistical Rigor

import numpy as np
from scipy import stats

# Confidence intervals for pass rate
def pass_rate_ci(successes, total, confidence=0.95):
    p = successes / total
    z = stats.norm.ppf((1 + confidence) / 2)
    margin = z * np.sqrt(p * (1-p) / total)
    return (p - margin, p + margin)

# Comparing two agents: paired bootstrap test
def compare_agents(scores_a, scores_b, n_bootstrap=10000):
    diffs = np.array(scores_a) - np.array(scores_b)
    boot_means = [np.mean(np.random.choice(diffs, len(diffs))) for _ in range(n_bootstrap)]
    p_value = np.mean(np.array(boot_means) <= 0)
    return {"mean_diff": np.mean(diffs), "p_value": p_value}

Minimum sample sizes: 100 test cases for rough estimates, 300+ for reliable comparisons between agents. Single-digit differences on < 50 cases are noise.

Regression Testing

# Track performance over agent versions
def regression_check(current_results, baseline_results, threshold=0.05):
    current_rate = sum(r["success"] for r in current_results) / len(current_results)
    baseline_rate = sum(r["success"] for r in baseline_results) / len(baseline_results)

    regression = baseline_rate - current_rate
    if regression > threshold:
        failing_cases = [
            r["task"] for r, b in zip(current_results, baseline_results)
            if b["success"] and not r["success"]
        ]
        return {"regression": True, "drop": regression, "failing_cases": failing_cases}
    return {"regression": False}

Gotchas

  • LLM-as-judge is biased: judges favor verbose outputs, outputs matching their own style, and outputs presented first in A/B comparisons. Calibrate with human annotations, randomize presentation order, and use multiple judge models for critical evaluations
  • Benchmark contamination: if agent training data includes benchmark solutions, scores are inflated. Use held-out test cases, time-gated benchmarks (only issues after training cutoff), or create custom evals from your own production data
  • Non-determinism makes comparison hard: same agent on same task can succeed or fail depending on LLM sampling. Run each test case 3-5 times and report mean + variance. A single pass/fail result is unreliable for agent comparison

See Also