Memory and context engineering
Long-term memory with vector stores
Retrieval, embeddings, and when to remember vs forget.
Video coming soon
Memory that survives the session
Sliding window and summary memory both live inside a single session. Close the conversation and the memory is gone. Real users expect agents to remember things across days, weeks, and months: their preferences, prior projects, recurring problems.
This is what people mean when they say an agent has "long-term memory." It's almost always implemented the same way: turn observations into vector embeddings, store them in a database, and retrieve the relevant ones at the start of each new turn.
Embeddings in one paragraph
An embedding is a vector (a list of numbers, typically 384 to 1536 dimensions) that represents the meaning of a piece of text. Texts with similar meanings produce similar vectors. You can measure "similar" with cosine similarity.
from ollama import embeddings
vec = embeddings(model="nomic-embed-text", prompt="The user prefers dark mode.")["embedding"]
# vec is a list of 768 floatsTwo key properties:
- Semantic, not lexical. "user prefers dark mode" and "she likes dark themes" have similar embeddings even though they share almost no words.
- Cheap to compute and compare. You can embed thousands of facts and search them in milliseconds.
That's enough to build memory.
The flow
A long-term memory system has four operations:
- Write. When something worth remembering happens, embed it and store the embedding plus the original text.
- Search. When the user asks a new question, embed the question and find the K most similar stored facts.
- Inject. Prepend those K facts into the message list before calling the model.
- Update. Optionally, edit or delete facts when the agent learns they're outdated.
Here's the minimum viable version:
import json, ollama
from numpy import dot
from numpy.linalg import norm
class VectorMemory:
def __init__(self):
self.facts = [] # list of {"text": str, "embedding": list[float]}
def _embed(self, text):
return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]
def write(self, text):
self.facts.append({"text": text, "embedding": self._embed(text)})
def search(self, query, k=3):
if not self.facts:
return []
q = self._embed(query)
scored = [
(dot(q, f["embedding"]) / (norm(q) * norm(f["embedding"])), f["text"])
for f in self.facts
]
scored.sort(reverse=True)
return [text for _, text in scored[:k]]For real use you'd swap the in-memory list for a vector database: SQLite + sqlite-vec for local prototyping, or Postgres + pgvector, Qdrant, or Chroma for production. The interface is the same; the storage is what changes.
Wiring it into the agent
Two integration points: write after a turn, search before a turn.
class AgentWithMemory:
def __init__(self, system_prompt, tools, registry):
self.system = system_prompt
self.tools = tools
self.registry = registry
self.memory = VectorMemory()
def turn(self, user_input):
# SEARCH: pull relevant facts before reasoning
relevant = self.memory.search(user_input, k=3)
memory_msg = ""
if relevant:
memory_msg = "Relevant facts you remember:\n" + "\n".join(f"- {f}" for f in relevant)
messages = [
{"role": "system", "content": self.system + "\n\n" + memory_msg},
{"role": "user", "content": user_input},
]
result = inner_loop(messages, self.tools, self.registry)
# WRITE: extract and store anything memory-worthy
self._maybe_remember(user_input, result["answer"])
return result["answer"]
def _maybe_remember(self, question, answer):
# Ask the model what (if anything) is worth remembering
extracted = ollama.chat(
model="llama3",
messages=[{
"role": "system",
"content": "Extract any durable facts about the user from this exchange. Output one fact per line, or 'none' if there are no durable facts.",
}, {
"role": "user",
"content": f"Q: {question}\nA: {answer}",
}],
).message.content
for line in extracted.splitlines():
line = line.strip("- ").strip()
if line and line.lower() != "none":
self.memory.write(line)That's a working long-term memory loop in 50 lines.
What to remember vs what to skip
The single biggest failure mode of vector memory is storing too much. If you store every turn, retrieval becomes noisy. Search returns ten irrelevant facts mixed with one relevant one, and the model gets confused.
A useful filter: only store facts that meet one of these criteria.
| Worth remembering | Skip |
|---|---|
| User preferences ("prefers Vim", "uses Postgres") | One-off questions ("what's the weather?") |
| Personal details ("name is Ada, works at Acme") | Reasoning steps inside a task |
| Recurring problems ("auth bug in staging") | Greetings, pleasantries |
| Decisions made ("decided to use Stripe over Paddle") | Retrieved facts (you'll re-retrieve them) |
The extraction prompt above does this implicitly by asking for "durable facts." Be more aggressive in production: filter the extracted facts again before writing.
When retrieval goes wrong
Three common failure modes:
1. Stale facts
The user said "I'm using Postgres" six months ago. Now they're using SQLite. The memory still returns the Postgres fact, the agent gives advice for the wrong database.
Fix: timestamp every fact and either decay relevance over time or actively delete superseded facts. The smart version asks the model: "given this new information, are any existing facts now wrong?" and edits accordingly.
2. Conflicting facts
Two facts contradict each other. The retriever returns both, and the model picks one arbitrarily.
Fix: when you write a new fact, search for similar existing facts first. If you find conflicts, either merge them, mark old ones as stale, or store the conflict explicitly so the model knows.
3. Retrieval misses the point
The user asks "what was that database we settled on?" but the stored fact says "decided to use Stripe over Paddle for payments." The semantic search will return the payment fact because "settled on" matches "decided to use."
Fix: retrieve more, then have the model filter. Pull the top 10, then ask the model "which of these are relevant to the current question?" before injecting them. This costs an extra LLM call but prevents irrelevant memory pollution.
Memory vs RAG
Long-term memory and RAG (retrieval-augmented generation) are the same mechanism applied to different content:
| RAG | Long-term memory |
|---|---|
| Retrieves from documents | Retrieves from past conversations |
| Source: PDFs, wikis, code | Source: extracted facts |
| Updated by ingestion | Updated by interaction |
| Read-mostly | Read-write |
The plumbing is identical: embed query, find nearest neighbors, inject into context. Module 6 covers RAG in depth, including the agent-driven flavor where the agent itself decides when to retrieve.
Start without a vector database
For your first agent with memory, use the in-memory list above. You'll learn more from making the retrieval logic work well than from setting up a vector DB. When you have 10,000+ facts or need cross-process persistence, then graduate to a real database.
Key takeaway
Long-term memory is selective storage and retrieval of facts via embeddings. The hard parts are not the database, they're the policies: what to store, when to update, how to filter retrieval. Get those right and your agent appears to remember the user. Get them wrong and you ship an agent that confidently regurgitates outdated facts. The next lesson is the higher-level question that all three memory strategies have to answer: given a finite context window, what makes the cut?
Done with this lesson?