Lesson 12 of 12Track 3

MCP in production

Observability for MCP

Logging, tracing, and debugging MCP calls.

Interactive exercise ~10 min

Seeing what's happening

A working MCP setup has at least three moving pieces (host, client, server) and often many more (multiple servers, an orchestrator, multiple agent loops). When something breaks at 2am, you can't reason about it without observability data.

This lesson is about what to log, what to trace, what to monitor, and how to think about the boundaries between agent observability (Track 4 Module 3) and MCP-specific observability. The patterns are familiar from any distributed system; what's specific is the agent-MCP interaction shape.

Three observability layers

Logs

Discrete events with structured fields. Use logs for "what happened" and forensic reconstruction.

Traces

Causally linked spans across components. Use traces for "where did the time go" and "what was the chain of events that produced this outcome."

Metrics

Aggregated counters and histograms. Use metrics for SLOs and dashboards.

You usually want all three. They answer different questions and stack: a metric tells you something is wrong, a trace shows you where, a log shows you the specific event.

What to log per MCP request

Minimum useful fields for each tool call:

request_id (correlate the call across host, client, server)
session_id (group all calls in one agent run)
user_id (when authenticated)
server_name, tool_name, args_hash (don't log raw args; might contain secrets)
started_at, ended_at, duration_ms
status (ok, error, denied, timeout)
error_code, error_message (when applicable)
result_size_bytes (for cost / context-pollution monitoring)

log.info("mcp_tool_call", extra={
    "request_id": req_id,
    "session_id": session_id,
    "server_name": server,
    "tool_name": tool,
    "args_hash": hash_args(args),
    "duration_ms": elapsed,
    "status": status,
})

Structured logging (JSON) makes this queryable. Plain text becomes useless past a few thousand lines.

What NOT to log

Don't log:

Raw arguments that may contain user data, secrets, or PII.
Auth tokens (filter Authorization headers at the source).
Large response bodies (they explode log volume; log size + a hash instead).
Inner-loop tool call streams in detail unless explicitly debugging.

A common mistake: turning on verbose logging for debugging and forgetting to turn it off, then leaking auth tokens or user data into your log aggregator.

Tracing across the boundary

A trace span shape that works for MCP:

agent.session
  └─ agent.turn
      ├─ model.call             (LLM round trip)
      ├─ host.dispatch_tool
      │   └─ client.call_tool
      │       └─ server.handle      (in-process or RPC)
      │           └─ server.tool.weather    (the actual capability)
      └─ model.call             (next LLM round trip)

Each span knows its parent. The trace ID flows from the agent through to the server. When the server is in another process or another machine, the trace ID is propagated via headers (HTTP) or environment (stdio).

W3C Trace Context (traceparent header) is the standard. Most MCP SDKs support it for HTTP-based transports; for stdio, you can pass a trace ID via env var when spawning.

Metrics that matter

A short list of metrics worth dashboards:

Metric	Why
MCP call rate per server	Who's getting hammered
MCP error rate per (server, tool)	Where the failures are concentrated
MCP latency p50/p95/p99 per (server, tool)	Slow tools that need attention
MCP token cost per session	LLM cost driver
Server uptime / connection failures	Infrastructure health
Agent loop iterations per request	Quality regression signal
Tool denials per session	Allow-list / scope effectiveness

The first four are MCP-specific; the last three sit at the agent level but interact with MCP closely.

Distinguishing agent failures from MCP failures

A common confusion when debugging: the agent gave a bad answer, but you can't tell whether the model reasoned wrong or the MCP server returned bad data.

Solution: log both. The trace should show the exact tool result that was returned to the agent (or a hash plus a sample), so you can verify what the model saw. Without that, every debug session devolves into "I bet the model hallucinated."

# At the host level, log a snippet of the tool result.
log.debug("tool_result", extra={
    "request_id": req_id,
    "tool": namespaced_name,
    "result_summary": summarize(result, max_chars=200),
    "result_hash": hash(json.dumps(result, sort_keys=True)),
})

You're not logging the full result (privacy, volume), but the hash plus a summary lets you reconstruct what the agent saw enough to debug.

The "agent did one thing N times" pattern

A failure mode worth alerting on: an agent in a loop calling the same MCP tool repeatedly without making progress. This shows up in metrics as a spike in tool-call rate per session, often with the same arguments.

# Detector: same call > N times in same session
def detect_thrash(session_id, server, tool, args_hash, threshold=5):
    counter = SESSION_COUNTERS[session_id][(server, tool, args_hash)]
    counter.increment()
    if counter.value > threshold:
        alert("agent_thrashing", session_id=session_id, server=server, tool=tool)

This is a Track 4 Module 1 reliability problem (agents stuck in loops), but the signal lives in MCP observability. Wire it up.

Monitoring server health

Each MCP server has its own health story:

For stdio servers: did the process start? did it exit unexpectedly? if so, with what signal?
For HTTP servers: is the endpoint reachable? what's the response time? what's the error rate?

Hosts that depend on servers should have a health-check pulse: a periodic call to a known-cheap method (often tools/list) that confirms the connection is alive. If the pulse fails repeatedly, alert and try to reconnect.

Tracing across hosts

If your agent stack has multiple processes (a frontend, a backend, multiple worker agents), each one running its own MCP clients, traces have to span them all. The pattern is unchanged from web services: propagate trace IDs across every process boundary, including MCP requests.

When in doubt, treat MCP servers like microservices: they get their own service tier in your observability platform, with their own dashboards and alerts. Treating them as opaque libraries hides too much.

Dashboards I'd build first

For a production agent stack, build these dashboards in this order:

Per-server health and rate. Is each server up, how often is it called.
Top error tools. Which (server, tool) pairs are failing most.
Latency by (server, tool). What's slow.
Session shape. Average tools per session, average iterations, denial rate.
Cost. LLM cost per request, MCP cost per request if vendors charge.

Build them before you need them. Debugging without dashboards in production is much slower than building dashboards in advance.

Make every MCP call traceable from end to end

The single biggest observability investment that pays off: ensure that for every MCP tool call, you can find: the agent session that made it, the LLM call that prompted it, the request ID, the server's processing trace, and the result that came back. If any link in this chain is missing, debugging is harder. Spend the time at design time to make the chain unbroken.

Wrapping up Track 3

This module closes Track 3. From fundamentals through production, you've covered:

Module 1: What MCP is, host/client/server architecture, transports.
Module 2: Building MCP servers (tools, resources, prompts, lifecycle).
Module 3: Building MCP clients (architecture, multi-server, dynamic tools).
Module 4: Production MCP (auth, orchestration, observability).

You can now expose your own capabilities as MCP servers, connect to multiple servers from a single agent, secure the connections, orchestrate across them, and operate the whole thing visibly. Track 4 takes the broader view: reliability, evaluation, observability, deployment, and scaling for agent systems as a whole. Many of the MCP-specific patterns from this track generalize there.

Key takeaway

Observability for MCP follows the standard distributed-systems pattern: structured logs with no secrets, traces that span agent-host-client-server, metrics for rates and latency, dashboards before you need them. Two MCP-specific things to invest in: the ability to trace any tool call from agent to server, and detecting agent-thrash patterns in tool-call frequency. With this you have the full Track 3: protocol, servers, clients, production. Onward to Track 4.

>_mcp-observability.py

Loading editor...

Output will appear here.

Done with this lesson?

The orchestrator pattern for MCP

MCP in production

Complete

Back to track overview