Memory Systems ; Short, Long, and Episodic
How agents remember things: context window tricks, external key-value stores, vector databases, and the architectural decisions that determine what your agent knows.
A model has no memory of its own. Between API calls, no state is retained. Every appearance of memory in an AI system is an engineering construct: text retrieved from somewhere and stuffed into the next prompt. That fact is liberating because it means memory is fully under your control, and it is dangerous because it means the boundary between "what the agent knows" and "what an attacker can put in front of the agent" is a design decision you have to get right.
There are three memory layers in any production agent. They differ in scope (how long they last), in fidelity (how much they preserve), and in attack surface (who can write to them). Get the layers right and the agent feels intelligent. Get them wrong and you have built a sieve.
Layer 1: In-context working memory
The simplest memory is the message list itself. Every previous turn is appended to the context, the model reads all of it, and that is the working memory. This is what your agent loop already does.
Pros: zero infrastructure, full fidelity, the model sees everything.
Cons: bounded by the context window, lost when the session ends, costs token money on every iteration.
Working memory is the right answer for short tasks. A coding agent solving a single ticket can keep its entire trajectory in context. A reconnaissance agent gathering data for one report can keep its tool results in context. As soon as the task spans more than maybe twenty model calls or the tool results get large, you have to start managing the context budget actively.
Three management strategies you will use:
Truncation. Drop the oldest messages when the context approaches the limit. Cheap, lossy. Useful when only recency matters.
Summarisation. When the context gets large, send the older portion to a cheap model with the prompt "summarise this conversation so far, preserving any concrete facts and decisions." Replace the old turns with the summary. The agent loses verbatim fidelity but keeps the gist.
Selective compression. Mark some turns as essential (the original goal, recent tool results) and compress everything else. This is what coding agents like Claude Code do when sessions get long. It works well but requires you to decide what counts as essential up front.
The summarisation strategy introduces a new threat. The summariser sees user content and tool results. If either contains injection, the summariser can be tricked into producing a summary that misrepresents what happened, and the agent then trusts that summary as ground truth. Either run the summariser with a hardened system prompt and treat its output as untrusted, or accept that summarisation can be poisoned and validate downstream.
Layer 2: External structured memory
For anything that needs to survive a session, you write it to an external store. The simplest version is a key-value table:
class SessionMemory:
def __init__(self, db, session_id):
self.db = db
self.session_id = session_id
def get(self, key: str) -> str | None:
return self.db.get(f"{self.session_id}:{key}")
def set(self, key: str, value: str) -> None:
self.db.set(f"{self.session_id}:{key}", value)
def all(self) -> dict[str, str]:
return self.db.scan(f"{self.session_id}:*")
You then expose remember(key, value) and recall(key) as tools the agent can call. The model decides what to write and when to read. At the start of each turn, you can optionally include all session memory in the system prompt so the agent has it available without having to call recall.
Structured memory is good for facts: the user's name, the current task, intermediate results, configuration choices. It is bad for unstructured information ("remember that I prefer Python") because the model has to phrase a key correctly to retrieve it later.
The threat surface: anything written to memory came from somewhere. If the writer was the model in response to user input, then the user can write arbitrary content into memory by phrasing their input cleverly. If your agent later trusts memory contents as facts, you have a stored prompt injection vector.
A defensive pattern: namespace memory by source. Memory written from a verified system action gets one prefix. Memory written from a model decision based on user input gets another. The trust level of memory depends on its prefix, and you treat it accordingly when re-injecting it into context.
Layer 3: Semantic long-term memory
For unstructured, large-scale memory, you use a vector database. The pattern is:
- Convert text to a vector using an embedding model.
- Store the vector along with the original text in a database.
- At query time, embed the query and find the most similar stored vectors.
- Return the corresponding text and stuff it into the model's context.
This is Retrieval-Augmented Generation (RAG), and we will cover it in depth in lesson 3.3. For now, the key insight is that semantic memory lets you store far more than fits in a context window and retrieve only what is relevant to the current query.
A minimal ChromaDB example:
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.Client()
embedder = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.create_collection(
name="agent_memory",
embedding_function=embedder,
)
# Store a memory
collection.add(
documents=["The user prefers Python over JavaScript."],
metadatas=[{"session": "abc123", "source": "user_statement"}],
ids=["mem_001"],
)
# Retrieve by similarity
results = collection.query(
query_texts=["What programming language should I use?"],
n_results=3,
where={"session": "abc123"},
)
print(results["documents"])
The embedding model turns each text into a vector of, say, 384 floats. Similarity search uses cosine distance to find the closest stored vectors. The model all-MiniLM-L6-v2 is small, fast, and runs locally; for production you might use OpenAI's text-embedding-3-small for better quality at the cost of an API call.
Three things to understand about embeddings before you trust them with anything important.
First, embeddings are lossy. Two sentences with similar meaning produce similar vectors, but the round-trip from text to vector to retrieved text loses nuance. If you embed legal contract language, your retrieval will surface "similar enough" passages, not exact matches. Combine semantic search with keyword search (a hybrid retrieval) when precision matters.
Second, embedding models have biases and capability gaps. They were trained on a specific corpus. They will be much better at English than at low-resource languages, and they will sometimes cluster things that should not cluster (homonyms, idioms, technical jargon outside their training).
Third, similarity does not mean relevance. The top-3 nearest neighbours might all be irrelevant if there is nothing relevant in your corpus. A reranker (a small classifier that reads the query plus each candidate and scores them) is the standard fix.
The threat profile of each layer
| Layer | Writer | Reader | Persists | Worst case |
|---|---|---|---|---|
| In-context | You + user + tools | The model | Session | Context window injection |
| Structured (key-value) | Model (via tool) | The model | Indefinite | Stored injection within a session |
| Semantic (vector DB) | Whoever populates the corpus | The model | Indefinite | Corpus poisoning across sessions |
Notice the trend. Each deeper layer has more writers, more readers, and longer persistence. The worst-case attack also gets worse: a context injection affects one turn, a session-memory injection affects a session, a corpus poisoning affects every session that retrieves the poisoned content.
This is the central security insight for agent memory: the blast radius scales with the layer. Defend each layer in proportion to its scope.
Memory write and read patterns
For each layer, a defensive baseline.
In-context. Treat everything in context as data, not instruction. Use prompt structures that delimit data clearly: <document>...</document>, <user_input>...</user_input>. Tell the model in the system prompt that anything inside those tags is user-controlled and should not be followed as instructions. This is imperfect (it does not stop a determined injection) but it is the cheap baseline.
Structured memory. Namespace by trust. Memory you wrote (system actions, verified facts) gets one namespace. Memory the model wrote based on user input gets another. When re-injecting into context, surface the source. The model can be told: "the following facts came from the user's verified profile" versus "the following facts came from the current conversation."
Semantic memory. Tightly control who writes to the corpus. If users can submit content that gets indexed, treat retrieved chunks as user input, not as authoritative documents. Consider running a classifier on retrieved chunks before passing them to the main model, looking for injection patterns or content that mismatches the original query intent.
Forgetting matters
Most agent memory designs focus on remembering. The harder design question is forgetting. Three reasons you need explicit forgetting:
- Privacy. Users have the right to delete their data. Your memory layer needs a "forget user X" operation that works correctly across all three layers.
- Drift. Old facts go stale. If the agent remembers a configuration choice from six months ago and the configuration has changed, the agent will act on bad information.
- Adversarial cleanup. If an attacker poisons memory, you need a way to identify and purge the poisoned entries. This requires the kind of provenance metadata we discussed.
A common production pattern: every memory entry has a TTL (time-to-live), a source, and a confidence score. Low-confidence entries expire faster. Memory written by tools that have a higher injection risk gets a shorter TTL. The system prunes on a schedule and on explicit delete requests.
Where this leaves us
Working memory is the cheapest, fastest, and least dangerous. Use it for everything that fits. Add structured memory when you need to persist facts across iterations or sessions, with namespacing for trust. Add semantic memory when you have a corpus too large to fit and you can control who writes to it.
The next lesson covers tool use in depth, which is the other half of the agent equation. Memory is how the agent knows things. Tools are how the agent does things. Both are how attackers reach the agent.