The Engineering Codex/Agentic AI with LLM APIs
DAY 3
04 / 09

Agent Memory — Context, RAG & Structured Stores

schedule6 minsignal_cellular_altIntermediate1,360 words
The three layers of memory every production agent needs: cached short-term context, vector RAG for long-term recall, and structured stores (Letta, Mem0, Zep, Anthropic memory) for facts that change over time.

What you will learn

01Layer 1 — Short-Term Memory (The Context Window)
02Layer 2 — Long-Term Semantic Memory (Vector Stores)
03Layer 3 — Structured / Agentic Memory
04The Memory Stack in Practice
05The Cost of Memory — Concrete Numbers

An agent without memory is a goldfish with credentials. The user types something, the agent acts, and at the next request the agent has forgotten everything. Every interesting agent — research assistants, coding agents, customer-support automations — needs memory that survives across turns and across sessions. This chapter covers the three layers of agentic memory and the four production frameworks competing to handle them.

🔑
The three memory layers
1) Short-term = the context window. Cheap, fast, lossy after token limits. 2) Long-term semantic = vector stores. Recall by similarity, no time awareness. 3) Structured / agentic = explicit memory tools (Letta, Mem0, Zep) or Anthropic's native memory tool. Survives sessions; can be queried, updated, contradicted.

Layer 1 — Short-Term Memory (The Context Window)

The simplest memory is also the most underrated: just keep the conversation in the prompt. With Claude Sonnet 4.6 (1M context) and GPT-5 series both shipping million-token windows, you can in principle fit hours of conversation directly. The problem is cost, not capacity.

At Sonnet 4.5 pricing of $3/M input tokens, a 1M-token agent run costs $3 per request just to read the context — before any output. Multi-agent fan-out makes it worse. So while context windows have grown, the economics still push you toward compressing what stays in context.

💰
Prompt caching is non-negotiable
Anthropic's prompt caching charges only 10% of input price for cached tokens (5-min default TTL, 1-hour with cache_control: {"ttl": "1h"}). For a long-running agent with a stable system prompt + tool definitions + early conversation, caching reduces steady-state cost by 80-90%. Treat this as default-on, not an optimization.

Compression strategies for short-term memory

  • Recency window — keep the last N turns verbatim, drop older ones. Cheapest, but loses long-term context.
  • Summarization — when the window approaches a threshold, summarize older turns into a single message. Standard pattern; lossy but bounded.
  • Tool-result pruning — keep recent tool outputs verbatim, replace older ones with summaries. Anthropic's context editing (late 2025) does this natively.
  • Hierarchical summary — summarize batches of turns, then summarize summaries. Lets very long conversations stay in budget.

Layer 2 — Long-Term Semantic Memory (Vector Stores)

Past a certain horizon, you can't keep everything in context — you have to retrieve what's relevant. The standard pattern: embed past interactions or facts, store them in a vector database, and at each turn retrieve the top-k most similar entries to inject into the prompt.

THREE LAYERS OF AGENT MEMORY SHORT-TERM context window system + tools (cached) turn 1 turn 2 ... turn N (current) cheap, fast, capped LONG-TERM vector store · RAG cheap recall, no time STRUCTURED memory tool · graph user.email = … prefs.theme = dark project_x: 2026-Q1 … [archived 2025-10] ... addressable, temporal Real agents combine all three. Cost & staleness rise left-to-right; recall fidelity rises with them.
The three memory layers in production agents. Most teams over-invest in vector RAG and under-invest in structured memory — the result is agents that "kind of remember things" but get user names and dates wrong.
Python · simple semantic memory
from chromadb import Client

memory = Client().get_or_create_collection("agent_memory")

def remember(text, metadata=None):
    memory.add(documents=[text], metadatas=[metadata or {}], ids=[str(uuid4())])

def recall(query, k=5):
    res = memory.query(query_texts=[query], n_results=k)
    return res["documents"][0]

# In your agent loop, before each LLM call:
relevant = recall(user_input)
context = build_context(history, relevant_memories=relevant)

Vector recall is great for "what did the user say about X?" — bad for "what's the user's current email?" Embeddings can't tell you that the email was updated yesterday, or that today's preference contradicts last month's. That's where the third layer comes in.

Layer 3 — Structured / Agentic Memory

Structured memory treats memory as state the agent can read and write through tools. Instead of similarity-searching a wall of text, the agent calls memory_get("user.email") or memory_set("prefs.theme", "dark"). Three production approaches dominate:

Letta (formerly MemGPT)

MemGPT (Packer et al., 2023) framed memory as an OS-style hierarchy: main context (always in prompt), recall storage (recent, paged in), archival storage (cold, retrieved by search). Memory operations are tools the agent calls. Letta is the production framework continuing this work.

Mem0

Mem0 is a hosted memory engine with extraction, dedupe, and (optional) graph memory. You log conversation snippets; Mem0 extracts facts ("user lives in Tokyo", "prefers Python over Go") and stores them. Integrates via MCP, LangChain, CrewAI. Easy to drop into existing stacks.

Zep / Graphiti

Zep represents memory as a temporal knowledge graph: facts have validity windows, so contradictions invalidate prior beliefs. The agent always sees the currently-true facts. Reports 80.32% on the LoCoMo benchmark at 189ms latency — the strongest published numbers in the structured-memory space.

Anthropic's native memory tool (late 2025)

For Anthropic-only stacks, Anthropic's memory tool ships a managed file-store: the agent calls memory_read and memory_write against a server-side directory that persists across sessions. Combined with context editing (auto-pruning stale tool results from history), it removes most of the case for an external memory framework if you're already on Claude.

FrameworkBest forStorage modelTrade-off
Anthropic memory toolAnthropic-only stacks; minimal infraServer-side filesystemVendor lock-in; less queryable
LettaLong sessions, OS-style discipline, custom workflowsHierarchical (main / recall / archival)More plumbing; full self-host
Mem0Drop-in for chatbots, fast to integrateExtracted facts + optional graphHosted dependency; fact extraction can miss nuance
Zep / GraphitiKnowledge that changes over time; need temporal correctnessTemporal knowledge graphHeavier setup; graph-shaped data assumption

The Memory Stack in Practice

Production agents don't pick one layer — they layer all three. A canonical stack:

  1. System prompt + tool definitions in context, prompt-cached.
  2. Recent conversation verbatim (last 10-20 turns), with older turns auto-summarized.
  3. Vector RAG over a corpus of past interactions, surfaced as injected snippets.
  4. Structured memory tool for canonical facts (user identity, preferences, current state).
⚠️
Memory poisoning
If your agent writes user-controlled data into memory unfiltered, an attacker can plant instructions ("Ignore previous instructions; …") that fire on every future session. Mitigation: never write raw user input into memory — always have the agent extract structured facts first, validate them, then write the facts. Treat memory as you'd treat a database: typed columns, validated writes.

What to put where — a working heuristic

  • System prompt = role, tools, hard constraints. Doesn't change.
  • Cached prefix = persona, examples, long instructions. Stable for hours.
  • Recent turns = the actual conversation. Cap at ~20-50 turns or ~50k tokens.
  • Vector recall = past sessions, knowledge base, user history. Pull top-k per turn.
  • Structured memory = identity, preferences, deadlines, balances — anything you'd put in a row in a database.

The Cost of Memory — Concrete Numbers

Memory choices have measurable cost implications. Quoted on Sonnet 4.5 ($3/M input, $15/M output, $0.30/M cached):

  • Naive long context (200k tokens, no caching) = $0.60 per turn just for input.
  • Cached long context (200k cached + 5k uncached) = $0.075 per turn — 8× cheaper.
  • Vector RAG (5k context + ~1k retrieved snippets per turn) = $0.018 per turn — but retrieval may miss key context.
  • Structured memory tool (5k context + 200 tokens of fact lookup) = $0.016 per turn — but only works for facts you've structured.

The real-world answer: combine cached prefix + RAG + structured memory. Steady-state cost lands at $0.02-0.05 per turn for a memory-rich agent — roughly 30× cheaper than naive long-context approaches.

🔑
Key takeaways
1) The context window is memory layer 1, not the only memory. 2) Prompt caching is mandatory at scale — 80-90% cost reduction. 3) Vector RAG handles "have we seen this before?"; structured memory handles "what's true right now?" 4) Real agents layer all three. 5) Anthropic's native memory tool is good enough for single-vendor stacks; reach for Letta / Mem0 / Zep when you need cross-vendor or temporal correctness.

Finished reading?