Deployment and scaling
Cost management and rate limiting
Keeping production agent costs under control.
Knowing where the dollars go
Agent systems can rack up surprising costs fast. A poorly-tuned prompt change can double per-request cost overnight. A runaway loop can burn a thousand dollars before anyone notices. A user who triggers expensive queries can rack up an unsustainable bill within hours.
This lesson is about controlling cost at multiple levels: per call, per session, per user, per day. The patterns are familiar from any infrastructure cost story; what's specific is which knobs matter for agents.
What costs money
Three buckets:
LLM tokens
Usually the largest line item. Cost is a function of input tokens, output tokens, and the model used. Long contexts (lots of tool results, lots of prior turns) and verbose outputs both burn budget.
MCP / tool / vendor costs
External APIs that the agent calls. Some are free (your own filesystem); some are pay-per-call (vendor APIs); some are pay-per-use (databases, search engines). Each has its own pricing model.
Infrastructure
The agent service, MCP servers, observability backend, queue, store. Usually small relative to LLM costs but worth tracking, especially as you scale.
For most agent products in 2026, LLM costs are 60-90% of total spend. Optimize there first.
Per-call levers
The smallest unit of cost is one LLM call. Levers:
Use a cheaper model when possible
A small model often works for routing, classification, or simple summarization. Reserve expensive models for the actual reasoning. The "chained models" pattern (cheap classifier picks a route, expensive model does the work) often cuts cost by half without quality loss.
async def handle(request):
route = await cheap_model.classify(request) # cheap call
if route in QUICK_ROUTES:
return await cheap_model.answer(request) # cheap call
return await big_model.answer(request) # expensive only when neededTrim the prompt
Every token in the input costs money on every turn. System prompts, tool descriptions, and conversation history all accumulate. Audit them; remove what isn't earning its keep.
Use prompt caching
For prompts that repeat (large system prompts, large tool descriptions), prompt caching cuts the cost of repeated input dramatically. Most providers support it; opt in.
Bound the output
max_tokens parameters cap response length. The agent doesn't need to write a 2000-token answer if 200 will do. Set sensible defaults; raise them only for tasks that need it.
Use structured outputs
Forcing structured output (JSON, function-call format) reduces the model's tendency to ramble. Outputs are smaller and more useful.
Per-session levers
A session is one user request through the agent. Levers:
Cap turns
A session that runs 30 turns instead of 5 costs 6x more. Set max-turn limits in your loop (Track 2 Module 4 lesson 2 covered the state machine; iteration caps belong there). Don't let an agent loop indefinitely.
Skip turns when possible
If a turn isn't going to add value (the model is just summarizing what we already know), skip it. Reflection passes (Track 2 Module 6) can be conditional: only run them when the agent's first answer was uncertain.
Compress history
Long conversations accumulate context. Periodically summarize old turns into shorter forms. Track 1 Module 5 covered this for context window management; it's also a cost lever.
Don't redo work
If the agent has already retrieved or computed something this session, don't ask the model to do it again. Cache results; reference them by ID.
Per-user levers
User-level cost matters for product viability. If your average user costs you $0.03 and you charge $10/month, fine. If your average is $3 and one user costs $300, you have a problem.
Per-user budgets
Hard caps on tokens or calls per user per day. When exceeded, the user gets a clear message: "you've reached your daily limit, resets at X." The pattern is the same as API rate limits.
async def execute(user, call):
used = await usage_store.get(user, period="day")
if used.tokens > user.daily_token_limit:
return {"status": "denied", "reason": "daily token limit exceeded"}
result = await actually_call(call)
await usage_store.add(user, tokens=result.tokens, period="day")
return resultTiered limits
Free tier gets less; paid tiers get more. The runtime policy from Module 1 lesson 4 is the right place for these limits to live.
Spike detection
A user who normally uses 10 calls a day suddenly making 1000 in an hour is suspicious. Either runaway loop on their side or abuse. Throttle automatically; alert humans to investigate.
Per-day levers
Aggregate spend has its own concerns:
Daily / monthly budgets with kill switches
A circuit breaker on total spend: if today's spend exceeds X, throttle non-essential traffic. Better to slow the system than to wake up to a six-figure bill.
Cost forecasts
Track daily spend; project to month-end. If the trajectory is too high, take action before you blow the budget.
Per-feature attribution
Some agent features are more expensive than others. Tag costs by feature so you can decide whether each one earns its keep.
Cost telemetry
Cost should be a first-class metric:
# at the end of each turn
log.info("agent_turn_completed", extra={
"session_id": sid,
"user_id": uid,
"model": model_name,
"tokens_in": resp.input_tokens,
"tokens_out": resp.output_tokens,
"estimated_cost_usd": estimate(model_name, resp),
})Aggregate to dashboards: cost per session, cost per user, cost per task type. Alert on cost spikes. Tie eval pipelines to cost so a "quality improvement" that doubled cost shows up clearly.
Cost vs quality tradeoffs
Most cost cuts have quality implications. The framing should be: "for this workload, what's the minimum-cost configuration that still hits our quality target?"
Concrete example: a customer-support agent might run a cheaper model with fewer turns most of the time, escalating to a stronger model only when the cheap one's confidence is low (Track 2 Module 6 lesson 3). The expensive path is rare; total cost drops; quality on the easy cases stays the same; quality on hard cases (handled by the expensive model) might even go up because that model isn't distracted by easy cases.
Common cost surprises
Massive prompts that don't need to be massive
A system prompt that grew from 800 to 3000 tokens because someone added a long example. Multiply by every turn, every session, every user. Expensive.
Tool results dumped wholesale
A tool that returns 50KB of JSON gets stuffed into the model's next prompt every turn until the agent moves on. Truncate tool outputs at the executor; surface a short summary plus an artifact pointer.
Reflection on every turn
Reflection is a cost multiplier. Doing it on every turn instead of conditionally is often 2x cost for marginal quality.
Long sessions on cheap users
A free-tier user starts a conversation that runs for an hour. Hard caps on session-length avoid this.
Models that got more expensive
Provider pricing changes. Cost per request can creep up because the same model now costs 20% more. Watch your cost-per-token, not just total cost.
When cost matters less
Some agent products genuinely don't need aggressive cost control:
- High-value B2B tools where each successful run produces real economic value.
- Internal employee tools where the cost is small relative to time saved.
- Heavy R&D / batch workloads where the budget is fixed and predictable.
Don't over-engineer cost controls for these. The mental overhead has its own cost. Optimize when cost is meaningful relative to value, not by default.
Per-user attribution is the most important cost metric
A single useful number to track from day one: cost per user, segmented by tier. It tells you whether your business model works (or doesn't), whether a small group of users is hogging resources, and whether features are profitable. If you can't compute "what's the average dollar cost of a free-tier user," you can't make pricing or product decisions on solid ground.
Key takeaway
Cost lives at four levels: per call, per session, per user, per day. Use cheaper models for cheap work, trim prompts, cap turns, compress history, set per-user budgets, alert on spikes, and have kill switches for daily totals. Track cost as a first-class metric alongside latency and quality. Cost cuts have quality implications; frame the optimization as "minimum cost that hits the quality target." The next and final lesson covers the human-in-the-loop patterns that compose with everything in this track.
Done with this lesson?