Agent Memory¶

Agent memory systems manage context across interactions - from short-term conversation buffers to persistent long-term knowledge stores. Since LLMs are stateless (each API call is independent), memory must be explicitly managed.

Key Facts¶

LLMs are stateless - the client must send full conversation history with each request
Token usage grows with conversation length - must be managed (truncation, summarization, windowing)
System prompt + retrieved context + history + new message must all fit in context window
Long-term memory persists across sessions via vector stores, databases, or file systems

Memory Types¶

Short-Term (Conversation Buffer)¶

Current dialogue history, cleared when session ends.

from langchain.memory import ConversationBufferMemory
# Stores all messages (grows unbounded)

from langchain.memory import ConversationBufferWindowMemory
# Keeps last K exchanges (sliding window)

from langchain.memory import ConversationSummaryMemory
# Summarizes old messages, keeps recent ones verbatim

Long-Term Memory¶

Persists across sessions. Stored in vector databases, traditional databases, or file systems. Enables remembering user preferences, past interactions, learned facts.

Episodic Memory¶

Specific interaction records: "Last time user asked about X, the answer was Y." Useful for personalization and avoiding repeated work.

Semantic Memory¶

General accumulated knowledge. Domain-specific facts, company information. Usually implemented as a RAG knowledge base.

Working Memory (Scratchpad)¶

Agent's intermediate reasoning during task execution. Accumulated thoughts, actions, observations in the ReAct loop. Grows with each step and must be managed.

Memory Management Strategies¶

Strategy	Mechanism	Tradeoff
Truncation	Drop oldest messages	Simple, but loses early context
Summarization	Periodically summarize older conversation	Preserves key info, costs tokens to summarize
Selective retention	Keep important messages, drop filler	Best quality, hardest to implement
External storage	Write to DB, retrieve relevant parts on demand	Unlimited history, but adds retrieval latency
Sliding window	Keep last K exchanges	Predictable cost, loses older context

Human-in-the-Loop (HITL)¶

Why HITL¶

Agent actions have real-world consequences (sending emails, modifying data)
LLMs make mistakes - human oversight catches errors
Compliance/regulatory requirements may mandate human approval
Builds trust during agent deployment

HITL Patterns¶

Approval gate: agent proposes action, waits for human approval:

from langgraph.graph import StateGraph

def human_review_node(state):
    action = state["proposed_action"]
    approval = get_human_approval(action)  # UI, webhook, etc.
    state["approved"] = approval
    return state

graph.add_node("propose_action", agent_propose)
graph.add_node("human_review", human_review_node)
graph.add_node("execute_action", agent_execute)

graph.add_conditional_edges(
    "human_review",
    lambda s: "execute" if s["approved"] else "revise",
    {"execute": "execute_action", "revise": "propose_action"}
)

Escalation: agent handles simple cases autonomously, escalates complex/uncertain cases to human.

Copilot pattern: automate boring/repetitive parts, keep human for judgment. Agent handles 70-80% of work, human reviews in seconds. 3x productivity gain typical.

Feedback loop: human corrections after agent acts improve future performance.

Conversation History Management¶

# Stateless API - must send full history each time
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {"role": "user", "content": "Follow-up question"}
]
response = client.chat.completions.create(model="gpt-4", messages=messages)

Track token count and trim when approaching the limit. Keep system prompt and recent messages, summarize or drop older ones.

Gotchas¶

Conversation history grows unbounded without management - eventually exceeds context window
Summary memory loses nuance - critical details may be dropped
External memory (vector store) adds retrieval latency to every query
Human-in-the-loop adds latency but prevents costly mistakes
Never assume the agent "remembers" from a previous session without explicit long-term memory
Memory retrieval (via embeddings) has the same cosine similarity failures as RAG