
Agent Memory — Context, RAG & Structured Stores
The three layers of memory every production agent needs: cached short-term context, vector RAG for long-term recall, and structured stores (Letta, Mem0, Zep, Anthropic memory) for facts that change over time.
What you will learn
An agent without memory is a goldfish with credentials. The user types something, the agent acts, and at the next request the agent has forgotten everything. Every interesting agent — research assistants, coding agents, customer-support automations — needs memory that survives across turns and across sessions. This chapter covers the three layers of agentic memory and the four production frameworks competing to handle them.
Layer 1 — Short-Term Memory (The Context Window)
The simplest memory is also the most underrated: just keep the conversation in the prompt. With Claude Sonnet 4.6 (1M context) and GPT-5 series both shipping million-token windows, you can in principle fit hours of conversation directly. The problem is cost, not capacity.
At Sonnet 4.5 pricing of $3/M input tokens, a 1M-token agent run costs $3 per request just to read the context — before any output. Multi-agent fan-out makes it worse. So while context windows have grown, the economics still push you toward compressing what stays in context.
cache_control: {"ttl": "1h"}). For a long-running agent with a stable system prompt + tool definitions + early conversation, caching reduces steady-state cost by 80-90%. Treat this as default-on, not an optimization.Compression strategies for short-term memory
- Recency window — keep the last N turns verbatim, drop older ones. Cheapest, but loses long-term context.
- Summarization — when the window approaches a threshold, summarize older turns into a single message. Standard pattern; lossy but bounded.
- Tool-result pruning — keep recent tool outputs verbatim, replace older ones with summaries. Anthropic's context editing (late 2025) does this natively.
- Hierarchical summary — summarize batches of turns, then summarize summaries. Lets very long conversations stay in budget.
Layer 2 — Long-Term Semantic Memory (Vector Stores)
Past a certain horizon, you can't keep everything in context — you have to retrieve what's relevant. The standard pattern: embed past interactions or facts, store them in a vector database, and at each turn retrieve the top-k most similar entries to inject into the prompt.
from chromadb import Client memory = Client().get_or_create_collection("agent_memory") def remember(text, metadata=None): memory.add(documents=[text], metadatas=[metadata or {}], ids=[str(uuid4())]) def recall(query, k=5): res = memory.query(query_texts=[query], n_results=k) return res["documents"][0] # In your agent loop, before each LLM call: relevant = recall(user_input) context = build_context(history, relevant_memories=relevant)
Vector recall is great for "what did the user say about X?" — bad for "what's the user's current email?" Embeddings can't tell you that the email was updated yesterday, or that today's preference contradicts last month's. That's where the third layer comes in.
Layer 3 — Structured / Agentic Memory
Structured memory treats memory as state the agent can read and write through tools. Instead of similarity-searching a wall of text, the agent calls memory_get("user.email") or memory_set("prefs.theme", "dark"). Three production approaches dominate:
Letta (formerly MemGPT)
MemGPT (Packer et al., 2023) framed memory as an OS-style hierarchy: main context (always in prompt), recall storage (recent, paged in), archival storage (cold, retrieved by search). Memory operations are tools the agent calls. Letta is the production framework continuing this work.
Mem0
Mem0 is a hosted memory engine with extraction, dedupe, and (optional) graph memory. You log conversation snippets; Mem0 extracts facts ("user lives in Tokyo", "prefers Python over Go") and stores them. Integrates via MCP, LangChain, CrewAI. Easy to drop into existing stacks.
Zep / Graphiti
Zep represents memory as a temporal knowledge graph: facts have validity windows, so contradictions invalidate prior beliefs. The agent always sees the currently-true facts. Reports 80.32% on the LoCoMo benchmark at 189ms latency — the strongest published numbers in the structured-memory space.
Anthropic's native memory tool (late 2025)
For Anthropic-only stacks, Anthropic's memory tool ships a managed file-store: the agent calls memory_read and memory_write against a server-side directory that persists across sessions. Combined with context editing (auto-pruning stale tool results from history), it removes most of the case for an external memory framework if you're already on Claude.
| Framework | Best for | Storage model | Trade-off |
|---|---|---|---|
| Anthropic memory tool | Anthropic-only stacks; minimal infra | Server-side filesystem | Vendor lock-in; less queryable |
| Letta | Long sessions, OS-style discipline, custom workflows | Hierarchical (main / recall / archival) | More plumbing; full self-host |
| Mem0 | Drop-in for chatbots, fast to integrate | Extracted facts + optional graph | Hosted dependency; fact extraction can miss nuance |
| Zep / Graphiti | Knowledge that changes over time; need temporal correctness | Temporal knowledge graph | Heavier setup; graph-shaped data assumption |
The Memory Stack in Practice
Production agents don't pick one layer — they layer all three. A canonical stack:
- System prompt + tool definitions in context, prompt-cached.
- Recent conversation verbatim (last 10-20 turns), with older turns auto-summarized.
- Vector RAG over a corpus of past interactions, surfaced as injected snippets.
- Structured memory tool for canonical facts (user identity, preferences, current state).
What to put where — a working heuristic
- System prompt = role, tools, hard constraints. Doesn't change.
- Cached prefix = persona, examples, long instructions. Stable for hours.
- Recent turns = the actual conversation. Cap at ~20-50 turns or ~50k tokens.
- Vector recall = past sessions, knowledge base, user history. Pull top-k per turn.
- Structured memory = identity, preferences, deadlines, balances — anything you'd put in a row in a database.
The Cost of Memory — Concrete Numbers
Memory choices have measurable cost implications. Quoted on Sonnet 4.5 ($3/M input, $15/M output, $0.30/M cached):
- Naive long context (200k tokens, no caching) = $0.60 per turn just for input.
- Cached long context (200k cached + 5k uncached) = $0.075 per turn — 8× cheaper.
- Vector RAG (5k context + ~1k retrieved snippets per turn) = $0.018 per turn — but retrieval may miss key context.
- Structured memory tool (5k context + 200 tokens of fact lookup) = $0.016 per turn — but only works for facts you've structured.
The real-world answer: combine cached prefix + RAG + structured memory. Steady-state cost lands at $0.02-0.05 per turn for a memory-rich agent — roughly 30× cheaper than naive long-context approaches.
- MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023)arxiv.org
- Anthropic — Context Management with the Memory Toolclaude.com
- Anthropic — Prompt Cachingdocs.claude.com
- Zep — Temporal Knowledge Graphs for AI Agentsgetzep.com
- Mem0 — Memory Engine for AI Agentsdocs.mem0.ai
Finished reading?