Context window budgeting

What goes in, what gets summarized, what gets dropped.

~10 min

The budget you don't think about

You have a context window. Let's say it's 8K tokens. Every turn, you fill that window with something. The question is: with what, exactly?

Most engineers don't think about this until something breaks. Then they realize they've been spending 6K tokens of their 8K budget on conversation history they don't need, leaving the model 2K tokens to actually reason. The model gives a worse answer, they blame the model.

Context budgeting is the practice of treating your context window like a finite resource and deciding deliberately what gets to live in it.

What competes for the window

Every message you send to the model fights for the same fixed budget. The competitors:

Category	Typical share	Notes
System prompt	200 to 1000 tokens	Often bloated; trim ruthlessly
Tool definitions	500 to 3000 tokens	Each tool's schema and description
Long-term memory	200 to 2000 tokens	Retrieved facts injected per turn
Conversation history	Varies	Grows unbounded without management
Current user message	50 to 500 tokens	The actual question
Tool observations	500 to 4000 tokens each	Search results, file contents, API responses
Reserved output	500 to 2000 tokens	Has to fit too, model can't write outside the window

Add these up for a real session and you'll often be over the limit before the model writes a single token.

A budgeting exercise

Take an existing agent of yours and run this calculation on a typical turn:

def measure_budget(messages, model_window):
    counts = {}
    for m in messages:
        role = m.get("role", "unknown")
        counts[role] = counts.get(role, 0) + estimate_tokens(m["content"])
    counts["total"] = sum(counts.values())
    counts["available"] = model_window - counts["total"]
    return counts

Run it after a representative conversation. You'll usually find one of three patterns:

Bloated system prompt. A 2K-token system prompt with examples that the model has long since memorized.
Sprawling history. Every tool observation from the last 20 turns still sitting in messages.
Repeating the same fact. The same retrieved context appearing in every turn because nothing dedupes.

Each one has a fix.

The four levers

Once you see the budget, you have four levers to pull:

1. Trim what's static

The system prompt is sent on every turn. A wasted 500 tokens there is 500 wasted tokens per turn, multiplied across the whole session.

Remove example outputs the model can already produce, redundant role-play instructions, and lengthy formatting rules that could be a one-line constraint instead.

# Before: 800 tokens
SYSTEM = """You are a helpful, friendly, professional assistant who specializes in
answering questions about software engineering. You should always be polite,
thorough, and accurate. Here are some examples:
[300 tokens of examples]
Always respond in markdown. Always cite sources. Never make up facts. ...
"""
 
# After: 80 tokens
SYSTEM = """Software engineering assistant. Cite sources. Respond in markdown.
Refuse to answer if uncertain."""

2. Truncate what's dynamic

Tool observations are the worst offender. A search that returns 5KB of HTML, a file read that returns a 2000-line file, a database query that returns 100 rows.

Truncate at the tool boundary, not in the loop:

def call_tool(call, registry):
    result = registry[call.function.name](**call.function.arguments)
    text = json.dumps(result, default=str) if not isinstance(result, str) else result
    if len(text) > 2000:
        text = text[:2000] + f"\n... [truncated, {len(text) - 2000} chars]"
    return text

If the model needs more, it can call the tool again with a refined query. Better to cycle than to dump.

3. Summarize what's history

We covered this in lesson 2. As the conversation grows, fold older turns into a summary. The fold should be aggressive: a 4000-token chunk of conversation often compresses to 200 tokens without meaningful loss.

4. Retrieve only what's relevant

We covered this in lesson 3. Don't dump all stored memory into every turn. Embed the current question and pull only the K most relevant facts.

The trap: K too large defeats the purpose, K too small misses things. Tune K against your dataset; common starting values are K=3 to K=5.

The priority order

When the budget is tight, drop in this order:

Drop: older conversation turns (replace with summary)
Drop: tool observations from previous turns once the answer is delivered
Drop: retrieved facts that didn't make it into the answer
Truncate: tool observations in the current turn
Trim: the system prompt
Last resort: drop tool definitions you don't think the model needs

Tool definitions are last because dropping them changes what the model can do. Everything else is reducing noise; this is reducing capability.

Compaction events

A useful pattern is the compaction event: a checkpoint where you actively shrink the message list before continuing.

def compact_if_needed(messages, model_window, target=0.5):
    """When messages exceed target * window, compact aggressively."""
    if estimate_tokens(messages) < target * model_window:
        return messages
 
    # Pull out the static parts
    system = [m for m in messages if m["role"] == "system"]
 
    # Find the latest finalized turn boundary
    last_user_idx = max(i for i, m in enumerate(messages) if m["role"] == "user")
    pre_last = messages[len(system):last_user_idx]
    current = messages[last_user_idx:]
 
    # Summarize everything before the current turn
    summary = summarize(pre_last)
    return system + [{"role": "system", "content": f"Earlier: {summary}"}] + current

Trigger this on token thresholds, every K turns, or after every finalized task. The exact trigger matters less than having one at all.

Budget-aware tools

A subtle point: your tools should be designed knowing the budget exists. A tool that returns "the entire file" is a budget hazard. A tool that returns "lines 50 to 100 of the file" is a budget good citizen.

# Bad: dumps the world
def read_file(path: str) -> str:
    return open(path).read()
 
# Good: paginated
def read_file(path: str, start: int = 0, length: int = 100) -> dict:
    lines = open(path).readlines()
    chunk = lines[start:start + length]
    return {
        "lines": chunk,
        "start": start,
        "total": len(lines),
        "more": start + length < len(lines),
    }

The agent can read the next page if it needs to. If it doesn't, the budget is preserved.

Bigger windows are not the answer

You might be thinking: just use a 200K-token model and stop worrying. Two problems. First, the per-token cost still applies. A 50-turn conversation in a 200K window is still 10x the cost of a 50-turn conversation in a 20K window. Second, more context tends to reduce answer quality due to the lost-in-the-middle effect. The most reliable agents fit comfortably under their model's limit, not at the edge of it.

Key takeaway

Context budgeting is the meta-skill that ties together everything in this module. Every memory strategy (windowing, summarizing, retrieving) is a tactic for staying inside the budget. The four levers are: trim the static, truncate the dynamic, summarize the historical, and retrieve only the relevant. Sit down with one of your agents, measure where the tokens go, and pull the right lever. It's the highest-leverage performance work you can do without changing models.

The next module is the last in Track 1, and it's about applying these same retrieval ideas to a specific use case: agentic RAG, where the agent itself decides when and what to retrieve.

Done with this lesson?

Long-term memory with vector stores

Memory and context engineering

From static RAG to agentic RAG

Agentic RAG