MCP in production
The orchestrator pattern for MCP
Routing queries across multiple MCP servers.
Video coming soon
Routing across many servers
Once you have several MCP servers in production, you need a deliberate strategy for which server handles which kind of work. This is the "MCP orchestrator" pattern: a layer above raw multi-server access that decides where to route, when to fan out, and how to handle the case where two servers could plausibly handle the same query.
The orchestrator pattern reuses everything from Track 2 Module 3 (supervisor/worker, hierarchical, etc), but the workers are now MCP servers instead of Python functions. The reasoning is the same; the runtime model is wider.
Why you need an orchestrator
Three problems naive multi-server access doesn't solve:
Tool name collisions and ambiguity
Two servers expose search tools. The agent has filesystem.search and github.search. The agent might call the wrong one or pick neither because they look interchangeable. An orchestrator that knows which server is the right one for a given intent removes the ambiguity.
Cross-server queries
The user asks "show me PRs that modified auth.py." That's GitHub for the PR list and filesystem for the file context. No single server can answer; the agent has to combine. An orchestrator can decompose the query and dispatch to each server, then synthesize.
Cost and latency control
Some servers are slow or expensive. The orchestrator can prefer cheaper/faster servers when both could answer, falling back to expensive ones only when needed.
A simple orchestrator pattern
class MCPOrchestrator:
def __init__(self, servers, intent_router):
self.servers = servers # {name: client}
self.router = intent_router # callable: query -> list of (server, intent)
async def handle(self, query):
plan = self.router(query) # decompose into per-server intents
results = await asyncio.gather(*(
self.servers[s].call_tool_for_intent(intent)
for s, intent in plan
))
return self.synthesize(query, results)intent_router is the brain: given a user query, figure out which servers to involve and what to ask each one. This is itself an LLM call for nontrivial cases.
The router prompt
Given a user query and a list of available MCP servers (each with its
description), decide which servers to involve and what specifically to ask
each one. Respond with a JSON list of {server, intent}.
Servers:
- filesystem: read and search local files in /workspace
- github: list, read, and create pull requests on company repos
- sql: read-only queries against the analytics warehouse
- slack: post and read messages in company channels
Query: {user_query}The model returns a structured plan: [{"server": "github", "intent": "list PRs that modify auth.py"}, {"server": "filesystem", "intent": "show context around line numbers in auth.py"}]. The orchestrator dispatches.
This is the supervisor-worker pattern from Track 2 Module 3 lesson 2, with MCP servers as the workers. Same shape, larger ecosystem.
Decomposition strategies
Single-server queries
Most queries map to one server. The router picks it; the orchestrator does one round-trip; done.
Sequential decomposition
Query needs server A's output to inform server B's call. The router emits a plan with explicit ordering: "first list PRs, then for each PR look up the linked Linear ticket." Run sequentially.
Parallel decomposition
Query needs multiple independent pieces. Run in parallel and synthesize. "What are the open PRs and current open issues?" is two independent calls.
Recursive (rare)
The result of one call requires another decomposition. Treat as nested orchestration (Track 2 Module 3 lesson 4). Most production systems cap this at one level.
Synthesis across servers
When you have results from multiple servers, the synthesizer is the agent's call. A reasonable prompt:
The user asked: {query}
Here are findings from each MCP server:
- filesystem: {fs_result}
- github: {gh_result}
Combine these into one coherent answer.The synthesizer is a regular LLM call, just one that has access to all the gathered context. It's the same role as the supervisor's synthesis in Module 3.
Caching across the orchestrator
If two queries in a session both need the user's profile, you don't want to call the user-profile server twice. A small cache at the orchestrator level helps:
class CachingOrchestrator(MCPOrchestrator):
def __init__(self, servers, router, cache_ttl=60):
super().__init__(servers, router)
self.cache = TTLCache(ttl=cache_ttl)
async def call(self, server, tool, args):
key = (server, tool, frozenset(args.items()))
if key in self.cache:
return self.cache[key]
result = await self.servers[server].call_tool(tool, args)
if is_cacheable(server, tool):
self.cache[key] = result
return resultOnly cache calls that are safe to cache: read-only, deterministic, no side effects. A tool with mutate or send in its name should never be cached.
Per-server policy at the orchestrator
The orchestrator is the natural place to enforce cross-cutting policies:
- Cost budgets per server. If one server has been called 50 times this session, fall back to alternatives.
- Latency budgets per call. Hard timeout enforced uniformly so one slow server doesn't block.
- Per-user rate limits. If a user has exhausted their quota, the orchestrator denies further calls before they even leave.
- Result size caps. A server that returns a 100MB blob shouldn't poison the agent's context. Truncate at the orchestrator.
These policies map almost directly to Track 2 Module 5's safety controls. The MCP orchestrator is where you apply them in a multi-server context.
When the orchestrator should ask
Sometimes the right behavior is "I don't know which server should handle this; ask the user." Patterns:
Ambiguous intent
The router's confidence is low (Track 2 Module 6 lesson 3). The orchestrator surfaces the alternatives: "Did you mean searching the filesystem or searching GitHub?"
High-stakes action
The router suggests a destructive call. The orchestrator gates it through a human approval (Track 2 Module 5 lesson 3) before dispatching.
Out-of-scope query
No server in the registry plausibly handles the query. The orchestrator says "I don't have a tool for that" instead of guessing.
A good orchestrator has all three of these patterns built in. Without them, the system fails silently in confusing ways.
A worked example
A user asks the agent "summarize the changes in PR #123 and the discussion in Slack about it."
Router emits:
[
{"server": "github", "intent": "fetch PR #123, including diff and comments"},
{"server": "slack", "intent": "find messages mentioning PR #123 from the last 7 days"}
]Orchestrator dispatches in parallel. Each call hits its respective MCP server. Results come back:
- GitHub: full PR data.
- Slack: 12 messages, with threading.
Synthesizer combines: "PR #123 changes the retry logic to add exponential backoff. The Slack discussion focused on whether to keep the existing 5s default; team decided yes."
That's the production pattern: route, dispatch, synthesize. Same shape as Track 2 Module 3, applied across MCP servers.
Orchestrators and agents are nested
A single agent might have an orchestrator for its MCP servers, and itself might be a worker inside a larger multi-agent system. Each layer has its own routing logic; they compose. The MCP orchestrator's job is to make a single agent's interaction with many servers coherent. The supervisor/worker pattern from Track 2 makes a system of agents coherent. Use both layers when the problem genuinely needs them.
Key takeaway
Once you have multiple MCP servers, an orchestrator layer above the raw clients makes routing, decomposition, and policy enforcement coherent. It maps onto the supervisor/worker pattern from Track 2 with MCP servers as workers. Add caching for read-only calls, per-server budgets and timeouts for safety, and explicit clarification flows for ambiguous queries. The next and final lesson of this track covers observability: how to actually see what's happening inside this multi-server, multi-call setup when things go wrong.
Done with this lesson?