Lesson 15 of 20Track 1

Memory and context engineering

Summary memory and sliding window

Strategies for keeping context manageable.

Interactive exercise ~10 min

When append-everything stops working

Even with proper scoping, a conversation that runs long enough will eventually outgrow the context window. A 50-turn customer support session, a multi-hour coding session, a multi-day research project: these all produce more messages than any model can hold.

The two classic strategies for fitting an arbitrarily long conversation into a fixed context window are:

  • Sliding window. Keep the last N turns, drop everything older.
  • Summary memory. Compress old turns into a short summary, keep the recent turns verbatim.

Most production systems use both, layered.

Sliding window: drop the old stuff

The simplest possible memory bound:

def trim_messages(messages, max_recent=20):
    """Keep system prompt + the last max_recent messages."""
    if len(messages) <= max_recent + 1:
        return messages
    system = [m for m in messages if m["role"] == "system"]
    recent = messages[-max_recent:]
    return system + recent

Send the system prompt plus the last 20 messages, drop the rest. This works fine for casual chat but has an obvious problem: the model genuinely forgets anything older than the window. If the user mentioned their name 30 turns ago, it's gone.

Sliding window is a good default for stateless tasks (a one-off coding question doesn't care what you talked about an hour ago) and a bad default for relational tasks (a customer support agent needs to remember the original complaint).

Summary memory: compress, don't drop

Summary memory replaces old turns with a single summary message. The summary is generated by the model, asynchronously or on demand:

SUMMARY_PROMPT = """Summarize the following conversation between a user and an assistant.
Keep all important facts, decisions, and unresolved questions.
Drop pleasantries, repeated info, and tool-call scratch work.
Be concise but complete. Maximum 200 words."""
 
def summarize(messages):
    response = ollama.chat(
        model="llama3",
        messages=[
            {"role": "system", "content": SUMMARY_PROMPT},
            {"role": "user", "content": format_messages(messages)},
        ],
    )
    return response.message.content

When the conversation grows past a threshold, fold the oldest messages into a summary and prepend that summary as a new system-role message:

def manage_memory(messages, max_recent=20, max_total=40):
    if len(messages) <= max_total:
        return messages
 
    system = [m for m in messages if m["role"] == "system"]
    recent = messages[-max_recent:]
    middle = messages[len(system):-max_recent]
 
    summary_text = summarize(middle)
    summary_msg = {
        "role": "system",
        "content": f"Summary of earlier conversation:\n{summary_text}",
    }
    return system + [summary_msg] + recent

Now the model sees the original system prompt, a compact summary of older turns, and the recent turns verbatim.

Layering them

You can layer these into a hierarchy:

LayerWhat it storesWhen it triggers
Active windowLast N messages, verbatimAlways
Recent summaryCompressed previous N messagesWhen active window overflows
Long-term summaryCompressed previous summariesWhen summaries themselves overflow
Vector memoryEmbeddings of factsPulled in on demand (next lesson)

A practical rule of thumb: most production agents need at most two layers. Active window plus one summary is enough for sessions up to a few hours. Beyond that, you need persistent memory, which we cover in the next lesson.

When to summarize

Two reasonable triggers:

Token-based

Estimate the token count and summarize when you cross a threshold:

def should_summarize(messages, threshold=4000):
    return estimate_tokens(messages) > threshold

Turn-based

Summarize every K turns regardless:

def should_summarize(messages, every=10):
    user_turns = sum(1 for m in messages if m["role"] == "user")
    return user_turns > 0 and user_turns % every == 0

Token-based is more accurate. Turn-based is easier to reason about. Pick one and be consistent.

What a good summary looks like

A bad summary loses the very facts the agent needs:

The user asked some questions about their project and the assistant helped them.

A good summary preserves entities, decisions, and pending work:

User is debugging an authentication bug in their Django app. Assistant helped trace the issue to a missing CSRF token in the login form. User confirmed the fix works locally but the staging environment still fails. Open question: whether the staging proxy is stripping CSRF cookies.

Three things make a summary good:

  1. Specific entities. Names, files, IDs. Not "their app" but "their Django app."
  2. State of the work. What's done, what's pending, what's blocked.
  3. No commentary. Don't summarize the style of the conversation, summarize the content.

The summarizer prompt should ask for these explicitly. Models will produce vague, abstracting summaries by default. You have to push them toward concrete ones.

The drift problem

Summary memory has one nasty failure mode: each summarization is lossy, so if you summarize repeatedly, information drifts. By the tenth summary of a summary, names become "the project" and decisions become "they discussed it."

Two defenses:

Re-summarize from a richer source

Keep the original messages around (in cheap storage, not in context) and re-summarize from them when you need to refresh, instead of summarizing the previous summary.

class MemoryStore:
    def __init__(self):
        self.full_log = []        # cheap storage, not sent to model
        self.summary = None       # current summary, sent to model
 
    def add(self, message):
        self.full_log.append(message)
        if len(self.full_log) % 20 == 0:
            self.summary = summarize(self.full_log)  # always from full log

Pin critical facts

For facts that must not drift (the user's name, the current goal, key decisions), keep them out of the summary entirely and store them in a structured field that's appended to every summary:

SUMMARY_PROMPT = f"""Summarize the conversation. The following facts MUST appear in your summary verbatim:
- User name: {state.user_name}
- Current goal: {state.goal}
- Key decisions: {state.decisions}
"""

This is a small step toward a more structured memory, which is exactly what the next lesson covers.

Summaries are not free

A summarization call costs tokens. A 4000-token conversation might compress to 400 tokens, but you spent the 4000 tokens to summarize it. Run the math: summarization saves money only if the conversation will continue for several more turns. For short conversations, sliding window is cheaper. For long ones, summaries pay for themselves.

Key takeaway

Sliding window drops old turns. Summary memory compresses them. Both are strategies for fitting unbounded conversations into a fixed context window. Layered together, they handle most multi-turn agents up to a few hours of session time. For longer or cross-session memory, you need a different mechanism, and that's the next lesson: vector stores and retrieval.

>_summary-memory.py
Loading editor...
Output will appear here.

Done with this lesson?