Autonomous Agent Evolution¶

Q: File locking on shared writes

multiple agents writing to .shared/ simultaneously can corrupt JSON files. Use atomic writes (write to temp file, then rename) or per-agent subdirectories with periodic merge

Q: Shared skills can propagate bad patterns

if one agent writes a flawed skill to .shared/skills/, others will adopt it. Add a minimum score threshold before promoting observations to skills

★★★★★ Intermediate

Replacing fixed evolutionary search (agents as stateless workers) with long-lived autonomous agents that control the entire search process: what to retrieve, when to evaluate, what to retain. Key innovation: workspace isolation + shared knowledge layer + periodic reflection.

Core Architecture¶

Isolated Workspaces¶

Each agent operates in a separate workspace (git worktree, container, or directory) to prevent interference. Agents can run different experiments in parallel without merge conflicts.

agent-0/          # git worktree for agent 0
agent-1/          # git worktree for agent 1
agent-2/          # git worktree for agent 2
.shared/          # shared knowledge (symlinked into each workspace)
  attempts/       # historical evaluations indexed by commit hash
  notes/          # markdown observations (hierarchical)
  skills/         # reusable procedures and scripts

# Setup per-agent worktrees from a base repo
git worktree add agent-0 -b agent-0-branch
git worktree add agent-1 -b agent-1-branch
# Symlink shared knowledge into each
ln -s $(pwd)/.shared agent-0/.shared
ln -s $(pwd)/.shared agent-1/.shared

Why worktrees over branches: agents need simultaneous filesystem access. Branch switching would serialize work. Worktrees give each agent a full working copy while sharing the git object store.

Shared Knowledge Layer¶

Three artifact types, all readable by every agent:

Artifact	Format	Purpose
`attempts/`	`{commit-hash}.json` with score + metadata	Prevent re-evaluation of identical solutions
`notes/`	Hierarchical markdown files	Observations, patterns, failed approaches
`skills/`	Executable scripts + markdown docs	Reusable procedures discovered during search

// .shared/attempts/a3f2b1c.json
{
  "commit": "a3f2b1c",
  "agent": 2,
  "score": 0.3809,
  "approach": "greedy interval optimization with symmetry breaking",
  "timestamp": "2026-04-01T14:23:00Z",
  "parent_commit": "b7e4d2a",
  "delta_score": 0.0003
}

<!-- .shared/notes/optimization/symmetry-breaking.md -->
# Symmetry Breaking in Interval Placement

Forcing the first interval to start at 0 eliminates ~50% of the search
space without losing optimal solutions. Confirmed by agents 0 and 2
across 8 independent evaluations.

Related attempts: a3f2b1c, d4e5f6g

Heartbeat Mechanism¶

Three reflection types at different frequencies prevent tunnel vision:

Per-iteration reflection (every evaluation):

After eval #{n} with score {s}:
1. What did this change accomplish?
2. Was the score change expected?
3. Write 1-2 line observation to .shared/notes/

Periodic consolidation (every ~10 evaluations):

1. Review own progress over last 10 evals
2. Browse other agents' notes in .shared/notes/
3. Organize scattered observations into structured notes
4. Extract reusable patterns into .shared/skills/
5. Identify promising directions from other agents' work

Stagnation redirection (5 consecutive non-improving evaluations):

1. Forced reassessment: "Current approach is not working"
2. Read all recent notes from ALL agents
3. Identify unexplored directions
4. Pivot to fundamentally different approach
5. Log the pivot reason in notes

Without heartbeats, agents fixate on local optima and stop sharing knowledge. The consolidation step is critical - it forces cross-pollination between agents.

Implementation Patterns¶

The simplest approach for Claude Code / coding agent setups:

import json
import glob
from pathlib import Path

SHARED = Path(".shared")

def log_attempt(commit: str, score: float, approach: str, agent_id: int):
    """Log evaluation result to shared knowledge."""
    entry = {
        "commit": commit,
        "agent": agent_id,
        "score": score,
        "approach": approach,
    }
    (SHARED / "attempts" / f"{commit}.json").write_text(json.dumps(entry, indent=2))

def get_best_score() -> float:
    """Read current best from shared attempts."""
    best = 0.0
    for f in glob.glob(str(SHARED / "attempts" / "*.json")):
        data = json.loads(Path(f).read_text())
        best = max(best, data["score"])
    return best

def check_stagnation(agent_id: int, window: int = 5) -> bool:
    """Detect if this agent has stagnated."""
    my_attempts = sorted(
        [json.loads(Path(f).read_text())
         for f in glob.glob(str(SHARED / "attempts" / "*.json"))
         if json.loads(Path(f).read_text())["agent"] == agent_id],
        key=lambda x: x["timestamp"],
    )
    if len(my_attempts) < window:
        return False
    recent = my_attempts[-window:]
    return all(a.get("delta_score", 0) <= 0 for a in recent)

Evaluation Deduplication¶

Avoid re-running expensive evaluations on identical solutions:

import hashlib

def solution_hash(code: str) -> str:
    """Content-based hash ignoring whitespace/comments."""
    # Strip comments and normalize whitespace
    lines = [l.strip() for l in code.splitlines()
             if l.strip() and not l.strip().startswith("#")]
    return hashlib.sha256("\n".join(lines).encode()).hexdigest()[:12]

def already_evaluated(code: str) -> bool:
    h = solution_hash(code)
    return (SHARED / "attempts" / f"{h}.json").exists()

CORAL Framework¶

Reference implementation: github.com/Human-Agent-Society/Coral (MIT, 429 stars). Multi-agent research infrastructure by MIT/NUS/Stanford/Meta.

Setup and Launch¶

git clone https://github.com/Human-Agent-Society/CORAL.git && cd CORAL
uv sync --extra ui    # include web dashboard
uv run coral start -c examples/kernel_builder/task.yaml

Task Definition (task.yaml)¶

task_name: kernel_builder
num_agents: 4
runtime: claude-code   # or opencode, codex
grader: examples/kernel_builder/grader.py
max_iterations: 100
heartbeat_interval: 10  # consolidation every N evals
stagnation_threshold: 5

Custom Grader¶

from coral.grading import TaskGrader, ScoreBundle

class KernelGrader(TaskGrader):
    def evaluate(self, workspace: str) -> ScoreBundle:
        # Run benchmark, return structured scores
        cycles = run_kernel_benchmark(workspace)
        return ScoreBundle(
            primary=1.0 / cycles,          # lower cycles = higher score
            metrics={"cycles": cycles, "correctness": verify(workspace)},
        )

Supported Runtimes¶

Runtime	Command	Notes
Claude Code	`claude`	Default, most tested
OpenCode	`opencode`	Open-source terminal agent
Codex	`codex`	OpenAI coding agent

All require pre-installation. Runtime selection per task.yaml.

Evaluation Flow¶

Agents call uv run coral eval -m "description" which atomically: stages changes, commits, runs grader, records attempt to .coral/public/attempts/, updates shared knowledge.

Additional Features¶

Web dashboard on port 8420 (uv run coral ui) - real-time agent monitoring
LiteLLM gateway - custom model routing for non-default providers
Docker session mode - containerized agent isolation
Warm-start literature review - agents review prior art before optimization
Post-commit hooks - automatic evaluation triggers

Benchmarks¶

Task	CORAL (4 agents)	AlphaEvolve	Speedup
Erdos Minimum Overlap	99% of optimal, 34 min	99% of optimal, 5.2h	~9x
Anthropic Kernel	1103 cycles	1363 cycles	19% better

Comparison with Linear Approaches¶

Aspect	Linear (autoresearch)	Parallel Evolution
Agents	1	3-8
Search strategy	Sequential keep/discard	Parallel diverse exploration
Knowledge sharing	Git history only	Explicit shared knowledge layer
Stagnation handling	Manual	Automatic redirection
Reflection	Optional	Built-in heartbeat
Improvement rate	~9.5% (per eval)	~36.8% (per eval)
Total evaluations needed	84 (for same quality)	19
Cost per run	~$0.10/cycle	~$0.40/cycle (4 agents)
Effective cost/improvement	Higher	Lower (3-4x)

The parallel approach reaches better solutions with fewer total evaluations because agents explore different directions simultaneously and share discoveries.

Integration with Existing Patterns¶

With agent design patterns (Reflexion): heartbeat reflection is a formalized version of self-critique applied to the search process itself, not just individual outputs.

With multi agent systems (Shared State): the .shared/ knowledge layer is a concrete implementation of the shared-state communication protocol using the filesystem.

With context engineering: each agent maintains its own context focused on its current exploration direction. The shared knowledge layer acts as external memory, preventing context bloat from carrying all agents' history.

Gotchas¶

File locking on shared writes: multiple agents writing to .shared/ simultaneously can corrupt JSON files. Use atomic writes (write to temp file, then rename) or per-agent subdirectories with periodic merge
Note quality degrades without structure: agents generate vague notes ("tried X, didn't work") unless the heartbeat prompt explicitly requires structured observations with scores and hypotheses. Template the note format
Stagnation detection threshold matters: too sensitive (2-3 evals) causes premature pivots away from promising directions. Too loose (10+ evals) wastes compute. 5 consecutive non-improving evals is a reasonable default but should be tuned per task complexity
Shared skills can propagate bad patterns: if one agent writes a flawed skill to .shared/skills/, others will adopt it. Add a minimum score threshold before promoting observations to skills