Memory Transfer Learning for Coding Agents¶
Cross-domain memory transfer for coding agents. KAIST + NYU research showing that memory at the right abstraction level improves agent performance on unseen benchmarks by +3.7% average, up to +8.3% on individual tasks.
Paper: arxiv 2604.14004
Project: memorytransfer.github.io
Code: github.com/KangsanKim07/MemoryTransferLearning (pending release)
The Abstraction Axis¶
Four memory types ordered from concrete to abstract:
| Type | Format | Transfer Quality | Notes |
|---|---|---|---|
| Trajectory (MT) | (task, [(action, observation), …]) | Negative transfer risk | Full session log including failed steps |
| Workflow (MW) | (goal, [filtered_key_actions]) | Medium | Goal-relevant actions only, LLM-filtered |
| Summary (MS) | (task_summary, experience_paragraph) | Medium-high | Why inference succeeded or failed |
| Insight (MI) | (title, description, content) | Best | Task-agnostic principles |
Key finding: Insight beats all other formats across all models and benchmarks. Trajectory can cause negative transfer when domain-mismatched.
Insight Generation Prompt¶
Generate insights on why this task was successfully accomplished
WITHOUT mentioning specific files or details.
Generate generalizable insights for future similar tasks.
Example insight that transferred (LiveCodeBench → SWE-Bench):
"Create quick self-contained tests using an inline Python here-doc to validate fixes"
This works because it is task-agnostic — describes HOW to work, not WHAT the task was.
Failure example (Trajectory, negative transfer): R-language CLI commands applied to a C++ project. The trajectory described R-specific steps, which confused the agent in a different language context. The same lesson encoded as an insight ("inspect eval requirements, combine train+val, use robust preprocessing") transferred correctly.
Pipeline¶
Memory Generation¶
# For each completed task (success or failure):
# 1. Run agent, collect trajectory
# 2. LLM judge (gpt-5-mini) determines success/failure
# 3. Generate all 4 memory types from trajectory
# 4. Store with embedding (text-embedding-3-small)
# Different prompts for success vs. failure memories
# Failure insights: "what went wrong and how to avoid it"
# Success insights: "why this worked, generalize to other tasks"
Pool Construction¶
For benchmark B_i, the memory pool = all memories from all other benchmarks. Evaluation is strictly cross-domain — no self-leakage.
Embedding: text-embedding-3-small (OpenAI). Index: cosine similarity.
Retrieval¶
Top-N = 3 memories retrieved per inference.
Query construction differs by type: - Trajectory: embed task description directly → cosine search - Workflow / Summary / Insight: LLM writes 4-5 sentence solution plan → embed plan → cosine search (necessary because short titles don't encode enough context)
Retrieved memories prepended to system prompt.
Results¶
Benchmarks: LiveCodeBench v6, Aider-Polyglot, SWEBench-Verified, TerminalBench2, ReplicationBench, MLGym-Bench
Models: GPT-5-mini, DeepSeek V3.2, Qwen3-Coder-480B
| Metric | Value |
|---|---|
| Average improvement | +3.7% |
| Best single benchmark | +8.3% |
| Insight vs. others | Consistently best |
| Task-agnostic Insight vs. task-specific | +1.1% |
Efficiency Comparison¶
| Method | Memory count | Performance vs. MTL |
|---|---|---|
| MTL (this paper) | 431 | baseline (best) |
| AgentKB | 5,899 | -1.7% |
| ReasoningBank | 97 | -2.9% |
431 high-abstraction insights beat 5,899 lower-abstraction memories. Quality > quantity.
Cross-Model Transfer¶
Insights generated by one model transfer to other models: - GPT-5-mini → DeepSeek V3.2: works - GPT-5-mini → Qwen3-Coder-480B: works - Bidirectional: weaker → stronger models also transfer
Self-generated memories outperform cross-model, but cross-model still beats zero-shot.
What Transferred¶
Figure 4 (donut): majority of improvement comes from: - Structured workflows - Constraint guardrails - Safe editing practices
The "how to work" meta-knowledge — not domain-specific facts.
Negative Transfer: Root Causes¶
-
Domain-mismatched anchoring: structurally similar but contextually different memories mislead. A memory about R-language operations anchors the agent toward wrong tooling in a C++ context.
-
False validation confidence: verification memories from a different context create belief that validation is complete when context doesn't match.
-
Misapplied best-practice: successful pattern applied without checking if the preconditions hold.
Current limitation: LLM reranking and adaptive rewriting both underperform simple embedding retrieval. Static cosine search is brittle for heterogeneous agent settings, but no better retrieval method has been demonstrated yet.
Implications for Memory System Design¶
Abstraction Levels Map to Use Cases¶
| Memory Type (this paper) | Equivalent in practice |
|---|---|
| Trajectory | Full session transcript (rarely useful as memory) |
| Workflow | Handoff document (useful for same-task continuity) |
| Summary | Post-session debrief (useful for similar tasks) |
| Insight | CLAUDE.md rules, principles files (cross-task transfer) |
Converting Low-Abstraction to Insight¶
Before (feedback/rule — task-specific):
"Don't mock the database in tests — had a migration incident"
After (insight — task-agnostic):
"When the code under test depends on an external system with schema drift
(DB, API, file format), integration tests against a real instance catch
what mocks miss. Apply when testing runtime behavior, not unit logic."
The insight version transfers to Slack API integration tests, GraphQL schema validation, S3 bucket policy tests — not just database migrations.
Pool vs. Context Loading¶
The paper's retrieval model (embed query → fetch top-3 → prepend to system prompt) mirrors JIT context loading. Key difference from loading all memories: selective retrieval avoids context overload and reduces noise-from-irrelevant-memories.
For large memory systems (>100 files), embedding-based retrieval with top-3 selection outperforms full-context loading. See memory retrieval patterns for implementation.
Gotchas¶
- Insights that name files or specific details don't transfer: the generation prompt explicitly prohibits mentioning file names, function names, or domain-specific details. Violating this produces Workflow-level memories that are mislabeled as Insights and will cause negative transfer.
- Static retrieval fails at scale with heterogeneous tasks: the paper explicitly notes that simple cosine similarity degrades in heterogeneous agentic settings (multiple languages, multiple domains). This is a known open problem — no retrieval method clearly outperforms embedding similarity yet.
- Memory pool needs cross-domain data to be useful: in-domain memories (ReasoningBank approach) improve less than cross-domain pools. An Insight pool built from only one type of task produces lower gains than a diverse pool, even at smaller total size.
- Temporal validity not addressed: the paper has no mechanism for marking memories as stale. An Insight about a deprecated API or tooling pattern remains in the pool indefinitely. Add
valid_untilmetadata and periodic lint passes.
See Also¶
- memory architectures - memory system design patterns
- memory retrieval patterns - retrieval strategies, embedding vs. graph
- agent memory - short vs. long-term memory types
- verbatim retrieval vs extraction - when to extract vs. store verbatim
- temporal memory - time-decay and validity for memory entries