Limits of single agents

Context pollution, tool overload, and confused reasoning.

Video lesson Interactive exercise ~10 min

Video coming soon

Where the single agent runs out of room

Two lessons in we've defended the monolith hard. That was deliberate. Most agents that get blown up into multi-agent monsters didn't need to be. But there are real ceilings that single-agent architectures hit, and recognizing them is the only honest way to know when to stop optimizing the monolith and start splitting it up.

Three failure classes show up consistently in production. Each one has a clear diagnostic, and each one points toward a specific multi-agent pattern that we'll cover later in this track.

Failure 1: tool overload

When you have more than ~15 tools defined, picking the right one becomes unreliable. The model sees a long list of tool descriptions and starts:

Picking similar but wrong tools. It calls search_codebase when it should have called search_documentation.
Skipping tools entirely. It guesses an answer rather than searching the long list.
Calling the wrong variant. It uses get_user_by_id when it should have used get_user_by_email.

This isn't a model intelligence problem. It's a context problem: every tool description is competing for attention in the same prompt window, and beyond a certain count the signal-to-noise gets bad.

How to diagnose

Look at your traces. If you see the model picking near-miss tools more often as the registry grows, that's the symptom. A simple metric: track per-call which tool was picked vs which a human would have picked. As tool count goes up, that mismatch rate rises.

The fix

Two options. The lighter-weight one is tool scoping by route, which we did in the previous lesson. Each route gets a subset.

The heavier-weight one is multi-agent with specialized tool ownership: a research agent owns the search tools, a code agent owns the file tools, an ops agent owns the deployment tools. The orchestrator routes work to whichever agent has the relevant tools. We cover this in Module 3.

Failure 2: context pollution

A single agent operating across mixed domains accumulates context that's only relevant to one of those domains. After ten turns of code reading, the message list contains thousands of tokens about file contents, function names, and bug hypotheses. Then the user asks an unrelated question about deployment metrics.

The agent now reasons about deployment with all that code context still in scope. Two things go wrong:

The model gets distracted. It might reach for a code tool when an ops tool would have been right, because code is what's "warm" in the conversation.
Token cost balloons. You're paying to send all the irrelevant code context on every turn until something compacts it.

How to diagnose

Compare the model's behavior at message #1 of a fresh session vs message #20 of a long session. If quality degrades over the course of a session even when the questions stay simple, you have context pollution.

The fix

Track 1 Module 5 covered the within-session fix: aggressive summarization, scoped histories, retrieval over recall. Those work for moderate cases.

For severe cases (mixed domains in one conversation), the right fix is to give each domain its own state. Either spawn fresh inner-loop conversations per task and only carry forward the answer, or split into separate agents with separate state. The second option is the multi-agent answer.

Failure 3: confused reasoning across domains

This is the subtlest of the three and the most damaging. Even when tool count and context are bounded, a single agent asked to reason simultaneously about two domains often produces worse answers than two agents each focused on one.

A real example: a single agent asked "Why is the deploy failing and what's the fix?" might:

Look at the deployment logs (ops domain)
Read the failing code (code domain)
Try to write a code fix (code domain)
Try to validate against deployment requirements (ops domain)
Switch back to the code (code domain)
...

Each domain switch is a context shift. The model holds two mental models at once and tends to write code that ignores deployment constraints, or write deployment plans that ignore the code reality. Quality degrades not because the model can't do either task, but because it's doing both at once in the same scratchpad.

How to diagnose

Quality drops on tasks that genuinely span two or more specialized domains, even when each domain alone is something the model handles well. You can confirm by running the task with two separate agents (manually orchestrated) and seeing whether the answer improves.

The fix

Specialization. The supervisor/worker pattern (Module 3) lets one agent decompose the task and dispatch parts to specialists. The specialists each operate in a clean context with focused tools. The supervisor synthesizes their answers.

This is the most honest argument for multi-agent: not "more agents are smarter" but "specialization beats generalism for tasks that genuinely span domains."

Putting it together

Failure	Symptom	First line of defense	Last line of defense
Tool overload	Wrong-tool selection rate rises with tool count	Tool scoping by route	Multi-agent with split tool ownership
Context pollution	Quality drops over long sessions	Summarize, scope, retrieve	Per-domain agents with isolated state
Confused reasoning	Quality drops on tasks spanning domains	Better prompt structuring	Supervisor/worker with specialists

The first-line defenses are the ones we've been building toward all of Track 1 and most of this module. The last-line defenses are what the rest of Track 2 is about.

When not to multi-agent

A few honest counter-cases. If you see these, stay monolithic:

The "limits" you're hitting are actually a slow model. A bigger or faster model fixes the problem and multi-agent doesn't.
You haven't tried the within-session fixes yet (summary memory, retrieval, scoped tools). Multi-agent on top of broken context management is just broken context management with more boxes.
The task is genuinely single-domain. A focused 8-tool agent beats a 4-agent system at almost any pure code task.
Latency matters more than quality. Multi-agent always adds turns.

If none of those apply and you're still hitting one of the three failures above, splitting is the right call.

The 'just add more agents' trap

There's a recurring pattern where teams hit a quality ceiling, add a second agent, hit a different ceiling, add a third, and end up with a baroque 7-agent system that nobody fully understands. Each split should be motivated by a specific failure. Don't add agents because the architecture diagram looks impressive.

What's next

You now have a clear picture of where single-agent architectures live and where they break. The rest of Track 2 introduces the patterns that take over when single-agent isn't enough:

Module 2: how multiple agents communicate (message passing, conversation protocols, handoffs)
Module 3: orchestration topologies (sequential, supervisor/worker, hierarchical, swarm)
Module 4: state management across agents (shared vs isolated, state machines, checkpointing)
Module 5: safety controls (tool whitelisting, permission scopes, approval gates)
Module 6: metacognition (self-reflection, strategy adaptation, knowing when to escalate)

Each one earns its complexity by addressing one of the failure modes you saw in this lesson. Pay attention to which failure each pattern fixes, not just to how it works mechanically.

Key takeaway

Single-agent architectures fail in three predictable ways: tool overload, context pollution, and confused reasoning across domains. The within-session fixes carry you a long way. When they're not enough, the failure mode itself tells you which multi-agent pattern to reach for. The next module starts that journey by establishing how agents talk to each other in the first place.

>_single-agent-limits.py

Loading editor...

Output will appear here.

Done with this lesson?

Prompt routing

Single agent architectures

Why multiple agents?

Multi-agent communication