Memory and context engineering
Context window budgeting
What goes in, what gets summarized, what gets dropped.
The budget you don't think about
You have a context window. Let's say it's 8K tokens. Every turn, you fill that window with something. The question is: with what, exactly?
Most engineers don't think about this until something breaks. Then they realize they've been spending 6K tokens of their 8K budget on conversation history they don't need, leaving the model 2K tokens to actually reason. The model gives a worse answer, they blame the model.
Context budgeting is the practice of treating your context window like a finite resource and deciding deliberately what gets to live in it.
What competes for the window
Every message you send to the model fights for the same fixed budget. The competitors:
| Category | Typical share | Notes |
|---|---|---|
| System prompt | 200 to 1000 tokens | Often bloated; trim ruthlessly |
| Tool definitions | 500 to 3000 tokens | Each tool's schema and description |
| Long-term memory | 200 to 2000 tokens | Retrieved facts injected per turn |
| Conversation history | Varies | Grows unbounded without management |
| Current user message | 50 to 500 tokens | The actual question |
| Tool observations | 500 to 4000 tokens each | Search results, file contents, API responses |
| Reserved output | 500 to 2000 tokens | Has to fit too, model can't write outside the window |
Add these up for a real session and you'll often be over the limit before the model writes a single token.
A budgeting exercise
Take an existing agent of yours and run this calculation on a typical turn:
def measure_budget(messages, model_window):
counts = {}
for m in messages:
role = m.get("role", "unknown")
counts[role] = counts.get(role, 0) + estimate_tokens(m["content"])
counts["total"] = sum(counts.values())
counts["available"] = model_window - counts["total"]
return countsRun it after a representative conversation. You'll usually find one of three patterns:
- Bloated system prompt. A 2K-token system prompt with examples that the model has long since memorized.
- Sprawling history. Every tool observation from the last 20 turns still sitting in messages.
- Repeating the same fact. The same retrieved context appearing in every turn because nothing dedupes.
Each one has a fix.
The four levers
Once you see the budget, you have four levers to pull:
1. Trim what's static
The system prompt is sent on every turn. A wasted 500 tokens there is 500 wasted tokens per turn, multiplied across the whole session.
Remove example outputs the model can already produce, redundant role-play instructions, and lengthy formatting rules that could be a one-line constraint instead.
# Before: 800 tokens
SYSTEM = """You are a helpful, friendly, professional assistant who specializes in
answering questions about software engineering. You should always be polite,
thorough, and accurate. Here are some examples:
[300 tokens of examples]
Always respond in markdown. Always cite sources. Never make up facts. ...
"""
# After: 80 tokens
SYSTEM = """Software engineering assistant. Cite sources. Respond in markdown.
Refuse to answer if uncertain."""2. Truncate what's dynamic
Tool observations are the worst offender. A search that returns 5KB of HTML, a file read that returns a 2000-line file, a database query that returns 100 rows.
Truncate at the tool boundary, not in the loop:
def call_tool(call, registry):
result = registry[call.function.name](**call.function.arguments)
text = json.dumps(result, default=str) if not isinstance(result, str) else result
if len(text) > 2000:
text = text[:2000] + f"\n... [truncated, {len(text) - 2000} chars]"
return textIf the model needs more, it can call the tool again with a refined query. Better to cycle than to dump.
3. Summarize what's history
We covered this in lesson 2. As the conversation grows, fold older turns into a summary. The fold should be aggressive: a 4000-token chunk of conversation often compresses to 200 tokens without meaningful loss.
4. Retrieve only what's relevant
We covered this in lesson 3. Don't dump all stored memory into every turn. Embed the current question and pull only the K most relevant facts.
The trap: K too large defeats the purpose, K too small misses things. Tune K against your dataset; common starting values are K=3 to K=5.
The priority order
When the budget is tight, drop in this order:
- Drop: older conversation turns (replace with summary)
- Drop: tool observations from previous turns once the answer is delivered
- Drop: retrieved facts that didn't make it into the answer
- Truncate: tool observations in the current turn
- Trim: the system prompt
- Last resort: drop tool definitions you don't think the model needs
Tool definitions are last because dropping them changes what the model can do. Everything else is reducing noise; this is reducing capability.
Compaction events
A useful pattern is the compaction event: a checkpoint where you actively shrink the message list before continuing.
def compact_if_needed(messages, model_window, target=0.5):
"""When messages exceed target * window, compact aggressively."""
if estimate_tokens(messages) < target * model_window:
return messages
# Pull out the static parts
system = [m for m in messages if m["role"] == "system"]
# Find the latest finalized turn boundary
last_user_idx = max(i for i, m in enumerate(messages) if m["role"] == "user")
pre_last = messages[len(system):last_user_idx]
current = messages[last_user_idx:]
# Summarize everything before the current turn
summary = summarize(pre_last)
return system + [{"role": "system", "content": f"Earlier: {summary}"}] + currentTrigger this on token thresholds, every K turns, or after every finalized task. The exact trigger matters less than having one at all.
Budget-aware tools
A subtle point: your tools should be designed knowing the budget exists. A tool that returns "the entire file" is a budget hazard. A tool that returns "lines 50 to 100 of the file" is a budget good citizen.
# Bad: dumps the world
def read_file(path: str) -> str:
return open(path).read()
# Good: paginated
def read_file(path: str, start: int = 0, length: int = 100) -> dict:
lines = open(path).readlines()
chunk = lines[start:start + length]
return {
"lines": chunk,
"start": start,
"total": len(lines),
"more": start + length < len(lines),
}The agent can read the next page if it needs to. If it doesn't, the budget is preserved.
Bigger windows are not the answer
You might be thinking: just use a 200K-token model and stop worrying. Two problems. First, the per-token cost still applies. A 50-turn conversation in a 200K window is still 10x the cost of a 50-turn conversation in a 20K window. Second, more context tends to reduce answer quality due to the lost-in-the-middle effect. The most reliable agents fit comfortably under their model's limit, not at the edge of it.
Key takeaway
Context budgeting is the meta-skill that ties together everything in this module. Every memory strategy (windowing, summarizing, retrieving) is a tactic for staying inside the budget. The four levers are: trim the static, truncate the dynamic, summarize the historical, and retrieve only the relevant. Sit down with one of your agents, measure where the tokens go, and pull the right lever. It's the highest-leverage performance work you can do without changing models.
The next module is the last in Track 1, and it's about applying these same retrieval ideas to a specific use case: agentic RAG, where the agent itself decides when and what to retrieve.
Done with this lesson?