Skip to content

Memory Transfer Learning for Coding Agents

Intermediate

Cross-domain memory transfer for coding agents. KAIST + NYU research showing that memory at the right abstraction level improves agent performance on unseen benchmarks by +3.7% average, up to +8.3% on individual tasks.

Paper: arxiv 2604.14004
Project: memorytransfer.github.io
Code: github.com/KangsanKim07/MemoryTransferLearning (pending release)

The Abstraction Axis

Four memory types ordered from concrete to abstract:

Type Format Transfer Quality Notes
Trajectory (MT) (task, [(action, observation), …]) Negative transfer risk Full session log including failed steps
Workflow (MW) (goal, [filtered_key_actions]) Medium Goal-relevant actions only, LLM-filtered
Summary (MS) (task_summary, experience_paragraph) Medium-high Why inference succeeded or failed
Insight (MI) (title, description, content) Best Task-agnostic principles

Key finding: Insight beats all other formats across all models and benchmarks. Trajectory can cause negative transfer when domain-mismatched.

Insight Generation Prompt

Generate insights on why this task was successfully accomplished
WITHOUT mentioning specific files or details.
Generate generalizable insights for future similar tasks.

Example insight that transferred (LiveCodeBench → SWE-Bench):

"Create quick self-contained tests using an inline Python here-doc to validate fixes"

This works because it is task-agnostic — describes HOW to work, not WHAT the task was.

Failure example (Trajectory, negative transfer): R-language CLI commands applied to a C++ project. The trajectory described R-specific steps, which confused the agent in a different language context. The same lesson encoded as an insight ("inspect eval requirements, combine train+val, use robust preprocessing") transferred correctly.

Pipeline

Memory Generation

# For each completed task (success or failure):
# 1. Run agent, collect trajectory
# 2. LLM judge (gpt-5-mini) determines success/failure
# 3. Generate all 4 memory types from trajectory
# 4. Store with embedding (text-embedding-3-small)

# Different prompts for success vs. failure memories
# Failure insights: "what went wrong and how to avoid it"
# Success insights: "why this worked, generalize to other tasks"

Pool Construction

For benchmark B_i, the memory pool = all memories from all other benchmarks. Evaluation is strictly cross-domain — no self-leakage.

Embedding: text-embedding-3-small (OpenAI). Index: cosine similarity.

Retrieval

Top-N = 3 memories retrieved per inference.

Query construction differs by type: - Trajectory: embed task description directly → cosine search - Workflow / Summary / Insight: LLM writes 4-5 sentence solution plan → embed plan → cosine search (necessary because short titles don't encode enough context)

Retrieved memories prepended to system prompt.

Results

Benchmarks: LiveCodeBench v6, Aider-Polyglot, SWEBench-Verified, TerminalBench2, ReplicationBench, MLGym-Bench
Models: GPT-5-mini, DeepSeek V3.2, Qwen3-Coder-480B

Metric Value
Average improvement +3.7%
Best single benchmark +8.3%
Insight vs. others Consistently best
Task-agnostic Insight vs. task-specific +1.1%

Efficiency Comparison

Method Memory count Performance vs. MTL
MTL (this paper) 431 baseline (best)
AgentKB 5,899 -1.7%
ReasoningBank 97 -2.9%

431 high-abstraction insights beat 5,899 lower-abstraction memories. Quality > quantity.

Cross-Model Transfer

Insights generated by one model transfer to other models: - GPT-5-mini → DeepSeek V3.2: works - GPT-5-mini → Qwen3-Coder-480B: works - Bidirectional: weaker → stronger models also transfer

Self-generated memories outperform cross-model, but cross-model still beats zero-shot.

What Transferred

Figure 4 (donut): majority of improvement comes from: - Structured workflows - Constraint guardrails - Safe editing practices

The "how to work" meta-knowledge — not domain-specific facts.

Negative Transfer: Root Causes

  1. Domain-mismatched anchoring: structurally similar but contextually different memories mislead. A memory about R-language operations anchors the agent toward wrong tooling in a C++ context.

  2. False validation confidence: verification memories from a different context create belief that validation is complete when context doesn't match.

  3. Misapplied best-practice: successful pattern applied without checking if the preconditions hold.

Current limitation: LLM reranking and adaptive rewriting both underperform simple embedding retrieval. Static cosine search is brittle for heterogeneous agent settings, but no better retrieval method has been demonstrated yet.

Implications for Memory System Design

Abstraction Levels Map to Use Cases

Memory Type (this paper) Equivalent in practice
Trajectory Full session transcript (rarely useful as memory)
Workflow Handoff document (useful for same-task continuity)
Summary Post-session debrief (useful for similar tasks)
Insight CLAUDE.md rules, principles files (cross-task transfer)

Converting Low-Abstraction to Insight

Before (feedback/rule — task-specific):
"Don't mock the database in tests — had a migration incident"

After (insight — task-agnostic):
"When the code under test depends on an external system with schema drift
(DB, API, file format), integration tests against a real instance catch
what mocks miss. Apply when testing runtime behavior, not unit logic."

The insight version transfers to Slack API integration tests, GraphQL schema validation, S3 bucket policy tests — not just database migrations.

Pool vs. Context Loading

The paper's retrieval model (embed query → fetch top-3 → prepend to system prompt) mirrors JIT context loading. Key difference from loading all memories: selective retrieval avoids context overload and reduces noise-from-irrelevant-memories.

For large memory systems (>100 files), embedding-based retrieval with top-3 selection outperforms full-context loading. See memory retrieval patterns for implementation.

Gotchas

  • Insights that name files or specific details don't transfer: the generation prompt explicitly prohibits mentioning file names, function names, or domain-specific details. Violating this produces Workflow-level memories that are mislabeled as Insights and will cause negative transfer.
  • Static retrieval fails at scale with heterogeneous tasks: the paper explicitly notes that simple cosine similarity degrades in heterogeneous agentic settings (multiple languages, multiple domains). This is a known open problem — no retrieval method clearly outperforms embedding similarity yet.
  • Memory pool needs cross-domain data to be useful: in-domain memories (ReasoningBank approach) improve less than cross-domain pools. An Insight pool built from only one type of task produces lower gains than a diverse pool, even at smaller total size.
  • Temporal validity not addressed: the paper has no mechanism for marking memories as stale. An Insight about a deprecated API or tooling pattern remains in the pool indefinitely. Add valid_until metadata and periodic lint passes.

See Also