Skip to content

Agent Memory

Agent memory systems manage context across interactions - from short-term conversation buffers to persistent long-term knowledge stores. Since LLMs are stateless (each API call is independent), memory must be explicitly managed.

Key Facts

  • LLMs are stateless - the client must send full conversation history with each request
  • Token usage grows with conversation length - must be managed (truncation, summarization, windowing)
  • System prompt + retrieved context + history + new message must all fit in context window
  • Long-term memory persists across sessions via vector stores, databases, or file systems

Memory Types

Short-Term (Conversation Buffer)

Current dialogue history, cleared when session ends.

from langchain.memory import ConversationBufferMemory
# Stores all messages (grows unbounded)

from langchain.memory import ConversationBufferWindowMemory
# Keeps last K exchanges (sliding window)

from langchain.memory import ConversationSummaryMemory
# Summarizes old messages, keeps recent ones verbatim

Long-Term Memory

Persists across sessions. Stored in vector databases, traditional databases, or file systems. Enables remembering user preferences, past interactions, learned facts.

Episodic Memory

Specific interaction records: "Last time user asked about X, the answer was Y." Useful for personalization and avoiding repeated work.

Semantic Memory

General accumulated knowledge. Domain-specific facts, company information. Usually implemented as a RAG knowledge base.

Working Memory (Scratchpad)

Agent's intermediate reasoning during task execution. Accumulated thoughts, actions, observations in the ReAct loop. Grows with each step and must be managed.

Memory Management Strategies

Strategy Mechanism Tradeoff
Truncation Drop oldest messages Simple, but loses early context
Summarization Periodically summarize older conversation Preserves key info, costs tokens to summarize
Selective retention Keep important messages, drop filler Best quality, hardest to implement
External storage Write to DB, retrieve relevant parts on demand Unlimited history, but adds retrieval latency
Sliding window Keep last K exchanges Predictable cost, loses older context

Human-in-the-Loop (HITL)

Why HITL

  • Agent actions have real-world consequences (sending emails, modifying data)
  • LLMs make mistakes - human oversight catches errors
  • Compliance/regulatory requirements may mandate human approval
  • Builds trust during agent deployment

HITL Patterns

Approval gate: agent proposes action, waits for human approval:

from langgraph.graph import StateGraph

def human_review_node(state):
    action = state["proposed_action"]
    approval = get_human_approval(action)  # UI, webhook, etc.
    state["approved"] = approval
    return state

graph.add_node("propose_action", agent_propose)
graph.add_node("human_review", human_review_node)
graph.add_node("execute_action", agent_execute)

graph.add_conditional_edges(
    "human_review",
    lambda s: "execute" if s["approved"] else "revise",
    {"execute": "execute_action", "revise": "propose_action"}
)

Escalation: agent handles simple cases autonomously, escalates complex/uncertain cases to human.

Copilot pattern: automate boring/repetitive parts, keep human for judgment. Agent handles 70-80% of work, human reviews in seconds. 3x productivity gain typical.

Feedback loop: human corrections after agent acts improve future performance.

Conversation History Management

# Stateless API - must send full history each time
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {"role": "user", "content": "Follow-up question"}
]
response = client.chat.completions.create(model="gpt-4", messages=messages)

Track token count and trim when approaching the limit. Keep system prompt and recent messages, summarize or drop older ones.

Gotchas

  • Conversation history grows unbounded without management - eventually exceeds context window
  • Summary memory loses nuance - critical details may be dropped
  • External memory (vector store) adds retrieval latency to every query
  • Human-in-the-loop adds latency but prevents costly mistakes
  • Never assume the agent "remembers" from a previous session without explicit long-term memory
  • Memory retrieval (via embeddings) has the same cosine similarity failures as RAG

See Also