Conversation memory

Short-term context window management.

Video lesson ~10 min

Video coming soon

Memory is just a list

When people say an agent has "memory," they usually mean something more elaborate than what's actually happening. The truth is simple: memory is the list of messages you send to the model on each turn. Anything in that list is in memory. Anything not in it is forgotten.

That's it. The whole notion of conversational memory comes down to a messages list and what you choose to put in it.

This sounds underwhelming until you realize the implications. Every architectural choice about memory comes down to "what goes in the list, in what order, and when does it leave."

The simplest memory: append everything

The most basic agent stores every message verbatim and sends the whole list on every turn:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
]
 
while True:
    user_input = input("> ")
    messages.append({"role": "user", "content": user_input})
 
    response = ollama.chat(model="llama3", messages=messages)
    messages.append(response.message)
 
    print(response.message.content)

This works perfectly for the first dozen turns. The model has full context; nothing is lost. It also fails predictably as the conversation grows, because every model has a fixed context window.

Why naive memory breaks

Three concrete problems show up as the conversation grows:

1. The context window fills up

Every model has a hard limit. Llama 3 is typically 8K tokens. GPT-4o is 128K. Claude Sonnet is 200K. Once you exceed it, the API errors out or silently truncates.

# Pseudocode of what eventually happens
total_tokens = sum(estimate_tokens(m) for m in messages)
if total_tokens > MODEL_CONTEXT_WINDOW:
    raise ContextWindowExceededError("Too many messages")

2. Cost grows linearly per turn

Most APIs charge per input token. If you send all 50 turns of a conversation on turn 51, you pay for all 50 turns again. Then 51 turns on turn 52. The total cost of a long conversation grows quadratically in the number of turns.

3. The model gets distracted

Even within the context window, more text means worse focus. Models exhibit "lost in the middle" behavior: information buried in the middle of a long context is harder for the model to use than information at the beginning or end. A conversation with 80 messages has 70+ messages in the middle.

The first real fix: scoped messages

The simplest improvement is to scope what goes in the list. Not every message needs to be there.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
]
 
def add_user(text):
    messages.append({"role": "user", "content": text})
 
def add_assistant(text):
    messages.append({"role": "assistant", "content": text})

This is fine. But once tools enter the picture, you add tool-call messages and tool-result messages too, and the volume goes up dramatically. A single user question might produce 10 messages of internal tool chatter. Those are useful for one turn; they're noise after the answer is delivered.

A common pattern is to compact internal turns once a task finishes:

def finalize_turn(messages, user_question, final_answer):
    """Replace tool-call scratch with the question and final answer only."""
    # find the index of the user_question
    boundary = next(i for i, m in enumerate(messages) if m["role"] == "user" and m["content"] == user_question)
    return messages[:boundary] + [
        {"role": "user", "content": user_question},
        {"role": "assistant", "content": final_answer},
    ]

The user sees a clean conversation. Internal tool calls are erased once they served their purpose. The next turn doesn't have to drag the previous turn's reasoning chain along.

Memory boundaries

A useful frame: every agent has at least three memory scopes.

Scope	Lifetime	What it holds
Inner-loop state	One task	Tool calls, observations, intermediate reasoning
Conversation memory	One session	User messages, final agent answers
Persistent memory	Across sessions	Facts, preferences, prior projects (covered in lesson 3)

The mistake beginners make is using one big list for all three. The fix is to keep them separate and only promote information from a smaller scope to a larger one when it earns its way there.

A working memory layout

Here's a concrete sketch of how this looks in code:

class Conversation:
    def __init__(self, system_prompt):
        self.system_prompt = system_prompt
        self.history = []  # public conversation: user + final answers only
 
    def turn(self, user_input, tools, registry):
        # Build the per-task message list (inner-loop scope)
        task_messages = (
            [{"role": "system", "content": self.system_prompt}]
            + self.history
            + [{"role": "user", "content": user_input}]
        )
 
        # Run the inner loop with full tool-call detail
        result = inner_loop(task_messages, tools, registry)
 
        # Promote only the question and the final answer to history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": result["answer"]})
 
        return result["answer"]

The inner loop sees the system prompt, conversation history, and the new question. It generates a long internal trace as it works. None of that trace pollutes the conversation history. Only the question and the final answer are remembered.

This single design choice cuts your token usage by 5 to 10x on multi-turn agents that use tools.

The 'agent has memory' marketing

When a product says "our agent has memory," they almost always mean one of three things: (1) they keep conversation history (this lesson), (2) they summarize old turns into a compact form (next lesson), or (3) they store extracted facts in a vector store (the lesson after that). All three are layered on top of the same messages list. There's no magic.

Key takeaway

Memory is a list of messages. Everything else is a strategy for what to put in that list. The first practical move is to separate inner-loop scratch (which you discard) from conversation history (which you keep). The next lessons add two more strategies: summarizing old turns and retrieving relevant facts on demand. Together they let you run multi-hour agent sessions without blowing the context window.

Done with this lesson?

When ReAct breaks

The ReAct pattern

Summary memory and sliding window

Memory and context engineering