Long-term memory with vector stores

Retrieval, embeddings, and when to remember vs forget.

Video lesson Interactive exercise ~10 min

Video coming soon

Memory that survives the session

Sliding window and summary memory both live inside a single session. Close the conversation and the memory is gone. Real users expect agents to remember things across days, weeks, and months: their preferences, prior projects, recurring problems.

This is what people mean when they say an agent has "long-term memory." It's almost always implemented the same way: turn observations into vector embeddings, store them in a database, and retrieve the relevant ones at the start of each new turn.

Embeddings in one paragraph

An embedding is a vector (a list of numbers, typically 384 to 1536 dimensions) that represents the meaning of a piece of text. Texts with similar meanings produce similar vectors. You can measure "similar" with cosine similarity.

from ollama import embeddings
 
vec = embeddings(model="nomic-embed-text", prompt="The user prefers dark mode.")["embedding"]
# vec is a list of 768 floats

Two key properties:

Semantic, not lexical. "user prefers dark mode" and "she likes dark themes" have similar embeddings even though they share almost no words.
Cheap to compute and compare. You can embed thousands of facts and search them in milliseconds.

That's enough to build memory.

The flow

A long-term memory system has four operations:

Write. When something worth remembering happens, embed it and store the embedding plus the original text.
Search. When the user asks a new question, embed the question and find the K most similar stored facts.
Inject. Prepend those K facts into the message list before calling the model.
Update. Optionally, edit or delete facts when the agent learns they're outdated.

Here's the minimum viable version:

import json, ollama
from numpy import dot
from numpy.linalg import norm
 
class VectorMemory:
    def __init__(self):
        self.facts = []  # list of {"text": str, "embedding": list[float]}
 
    def _embed(self, text):
        return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]
 
    def write(self, text):
        self.facts.append({"text": text, "embedding": self._embed(text)})
 
    def search(self, query, k=3):
        if not self.facts:
            return []
        q = self._embed(query)
        scored = [
            (dot(q, f["embedding"]) / (norm(q) * norm(f["embedding"])), f["text"])
            for f in self.facts
        ]
        scored.sort(reverse=True)
        return [text for _, text in scored[:k]]

For real use you'd swap the in-memory list for a vector database: SQLite + sqlite-vec for local prototyping, or Postgres + pgvector, Qdrant, or Chroma for production. The interface is the same; the storage is what changes.

Wiring it into the agent

Two integration points: write after a turn, search before a turn.

class AgentWithMemory:
    def __init__(self, system_prompt, tools, registry):
        self.system = system_prompt
        self.tools = tools
        self.registry = registry
        self.memory = VectorMemory()
 
    def turn(self, user_input):
        # SEARCH: pull relevant facts before reasoning
        relevant = self.memory.search(user_input, k=3)
 
        memory_msg = ""
        if relevant:
            memory_msg = "Relevant facts you remember:\n" + "\n".join(f"- {f}" for f in relevant)
 
        messages = [
            {"role": "system", "content": self.system + "\n\n" + memory_msg},
            {"role": "user", "content": user_input},
        ]
 
        result = inner_loop(messages, self.tools, self.registry)
 
        # WRITE: extract and store anything memory-worthy
        self._maybe_remember(user_input, result["answer"])
        return result["answer"]
 
    def _maybe_remember(self, question, answer):
        # Ask the model what (if anything) is worth remembering
        extracted = ollama.chat(
            model="llama3",
            messages=[{
                "role": "system",
                "content": "Extract any durable facts about the user from this exchange. Output one fact per line, or 'none' if there are no durable facts.",
            }, {
                "role": "user",
                "content": f"Q: {question}\nA: {answer}",
            }],
        ).message.content
 
        for line in extracted.splitlines():
            line = line.strip("- ").strip()
            if line and line.lower() != "none":
                self.memory.write(line)

That's a working long-term memory loop in 50 lines.

What to remember vs what to skip

The single biggest failure mode of vector memory is storing too much. If you store every turn, retrieval becomes noisy. Search returns ten irrelevant facts mixed with one relevant one, and the model gets confused.

A useful filter: only store facts that meet one of these criteria.

Worth remembering	Skip
User preferences ("prefers Vim", "uses Postgres")	One-off questions ("what's the weather?")
Personal details ("name is Ada, works at Acme")	Reasoning steps inside a task
Recurring problems ("auth bug in staging")	Greetings, pleasantries
Decisions made ("decided to use Stripe over Paddle")	Retrieved facts (you'll re-retrieve them)

The extraction prompt above does this implicitly by asking for "durable facts." Be more aggressive in production: filter the extracted facts again before writing.

When retrieval goes wrong

Three common failure modes:

1. Stale facts

The user said "I'm using Postgres" six months ago. Now they're using SQLite. The memory still returns the Postgres fact, the agent gives advice for the wrong database.

Fix: timestamp every fact and either decay relevance over time or actively delete superseded facts. The smart version asks the model: "given this new information, are any existing facts now wrong?" and edits accordingly.

2. Conflicting facts

Two facts contradict each other. The retriever returns both, and the model picks one arbitrarily.

Fix: when you write a new fact, search for similar existing facts first. If you find conflicts, either merge them, mark old ones as stale, or store the conflict explicitly so the model knows.

3. Retrieval misses the point

The user asks "what was that database we settled on?" but the stored fact says "decided to use Stripe over Paddle for payments." The semantic search will return the payment fact because "settled on" matches "decided to use."

Fix: retrieve more, then have the model filter. Pull the top 10, then ask the model "which of these are relevant to the current question?" before injecting them. This costs an extra LLM call but prevents irrelevant memory pollution.

Memory vs RAG

Long-term memory and RAG (retrieval-augmented generation) are the same mechanism applied to different content:

RAG	Long-term memory
Retrieves from documents	Retrieves from past conversations
Source: PDFs, wikis, code	Source: extracted facts
Updated by ingestion	Updated by interaction
Read-mostly	Read-write

The plumbing is identical: embed query, find nearest neighbors, inject into context. Module 6 covers RAG in depth, including the agent-driven flavor where the agent itself decides when to retrieve.

Start without a vector database

For your first agent with memory, use the in-memory list above. You'll learn more from making the retrieval logic work well than from setting up a vector DB. When you have 10,000+ facts or need cross-process persistence, then graduate to a real database.

Key takeaway

Long-term memory is selective storage and retrieval of facts via embeddings. The hard parts are not the database, they're the policies: what to store, when to update, how to filter retrieval. Get those right and your agent appears to remember the user. Get them wrong and you ship an agent that confidently regurgitates outdated facts. The next lesson is the higher-level question that all three memory strategies have to answer: given a finite context window, what makes the cut?

>_vector-memory.py

Loading editor...

Output will appear here.

Done with this lesson?

Summary memory and sliding window

Memory and context engineering

Context window budgeting

Memory and context engineering