Lesson 9 of 14Track 4

Observability

Tracing multi-agent requests

Following a request through multiple agents.

Video lesson Interactive exercise ~10 min

Video coming soon

Following one request through the system

Logs answer "what happened?" Traces answer "where did the time go and what called what?" For agent systems, where one user request can fan out into dozens of tool calls across multiple servers, traces are how you actually understand a single execution.

This lesson covers how to instrument agents for tracing, how to model spans, what to capture, and how to use traces to debug production issues you couldn't otherwise see.

What a trace is

A trace is a tree of spans. Each span represents a unit of work (a function call, an API request, a tool invocation). Each span has:

A start and end timestamp.
A name.
A parent span (or none, for the root).
Attributes (key-value tags).
Events (timestamped messages within the span).

The trace is the whole tree. The trace ID is shared by every span; the span ID is unique to each span; the parent ID links them.

A typical agent trace:

agent.session [800 ms]
├── agent.turn [320 ms]
│   ├── model.call [180 ms]   {tokens_in: 1200, tokens_out: 80}
│   ├── tool.call (filesystem.read_file) [50 ms] {result_size: 2400}
│   └── tool.call (sql.query)            [80 ms] {rows: 12}
├── agent.turn [410 ms]
│   ├── model.call [220 ms]   {tokens_in: 1800, tokens_out: 130}
│   └── tool.call (filesystem.search)    [180 ms] {hits: 4}
└── agent.session.end [10 ms]

You can see at a glance: how many turns, how long each took, what each turn called, and where the time went.

Picking a tracing library

OpenTelemetry is the standard. It's library-agnostic, has SDKs for every major language, and most observability vendors (Datadog, Honeycomb, Tempo, etc.) accept its data format. Pick the OTel SDK for your language.

from opentelemetry import trace
 
tracer = trace.get_tracer("agent-app")
 
with tracer.start_as_current_span("agent.session") as session_span:
    session_span.set_attribute("user_id", user_id)
    with tracer.start_as_current_span("agent.turn") as turn_span:
        with tracer.start_as_current_span("model.call") as model_span:
            response = call_model(...)
            model_span.set_attribute("tokens_in", response.input_tokens)
            model_span.set_attribute("tokens_out", response.output_tokens)

The shape is the same in every OTel SDK: a span context manager that auto-closes when the block exits. The library handles parent-child relationships and propagation.

What to span

Span boundaries should map to interesting units of work. A reasonable set for agents:

agent.session (root span for one user request).
agent.turn (one iteration of the orchestration loop).
model.call (one LLM API request).
tool.call (one tool invocation, including MCP calls).
state.checkpoint (one persistence write).
validator.run (one validation pass).
human.approval (waiting for human approval).

Don't span every function. The right granularity is "things you'd want to time and look at separately." Spans below ~5ms add visual noise without insight.

Attributes worth capturing

Each span type has its own useful attributes:

Span	Useful attributes
`agent.session`	user_id, agent_type, request_source, total_cost_estimate
`agent.turn`	turn_number, state_at_start (e.g. "PLANNING")
`model.call`	model_name, tokens_in, tokens_out, temperature, used_tools
`tool.call`	tool_name, server_name, args_hash, status, result_size
`state.checkpoint`	bytes_written, store_type
`validator.run`	validator_name, passed (bool), issues_count

The attributes are what make traces queryable. "Show me sessions where total_cost_estimate > $1" is impossible without recording total_cost_estimate.

Trace propagation

When work crosses a process or service boundary, the trace context has to travel with it. OTel uses the W3C Trace Context standard: a traceparent header on HTTP requests, similar on RPC.

For HTTP-based MCP servers:

# Client side: OTel adds the traceparent header automatically when configured
async with httpx.AsyncClient() as client:
    await client.post(server_url, json=mcp_request)   # traceparent injected
 
# Server side: OTel reads the header and links spans to the parent
@app.post("/mcp")
async def mcp_handler(request):
    # incoming span automatically becomes child of the agent's span
    ...

For stdio MCP servers (no headers), pass the trace ID via env var when spawning. The server reads it and starts its own root span as a child.

Sampling

Tracing is more expensive than logging because each span has more data. Sampling controls cost.

Three patterns:

Head-based (decide at trace start)

Decide whether to sample the trace at the root span. If sampled, all spans in the trace are recorded. If not, none are. Cheap to implement; can miss interesting tails.

Tail-based (decide at trace end)

Buffer spans until the trace ends, then decide. If the trace had errors or took too long, sample. Otherwise drop. More expensive (you have to buffer) but catches the interesting cases.

Always-sample errors

Even with head-based sampling at 10%, always record traces that ended in errors. The math: errors are rare and the cost is low.

A reasonable production policy: 100% of error traces, 100% of slow traces (above a threshold), 5% sampling of everything else. You catch 100% of problems without keeping 100% of the noise.

Reading traces during incidents

When debugging a production incident, the workflow is:

Find the trace. From a user-reported request_id, query the trace database.
Look at the shape. Is it the right number of turns? Did any span fail?
Look at the latencies. Where did the time go? A span much longer than expected is a clue.
Look at the attributes. Which model? Which tools? What were the token counts?
Cross-reference with logs. The trace tells you what happened; logs (filtered by request_id) tell you the details.

Traces and logs are complementary. Traces give the structure; logs give the details.

Common trace patterns

Some patterns to recognize:

The fanout

One agent.turn produces many parallel tool.call spans. Useful: this is how parallel dispatch should look. Bad: if you expected one tool call and got ten, you have a runaway loop.

The cascade

A long chain of nested spans, each a fraction of its parent. Useful: shows the call chain clearly. Bad: very deep cascades suggest excessive nesting in the agent's planning.

The flat

Many sequential spans at the same level. Useful: simple linear work. Bad: missing parent-child relationships might indicate broken propagation.

The plateau

A long span with little inside it. Usually means the span captured wait time (waiting on a slow tool, a model call, a network round-trip). The thing you're waiting on may not be instrumented.

Trace-driven debugging

Some bugs only show up at the trace level, not at the log level:

N+1 patterns. The agent calls a tool once per item in a list when it should batch. Visible as a fanout of small spans inside a turn.
Sequential when parallel would work. Two independent tool calls running serially. Visible as two adjacent spans where parallelism would have stacked them.
Overlong setup. A span where most of the time is in setup before any meaningful work. Often a session-init bug.

These are hard to see from logs alone. Traces show timing and concurrency in a way text logs can't.

Privacy in traces

Same rules as logs: don't put sensitive data in span attributes. Especially:

No raw user input in attributes.
No raw tool arguments (use a hash or a redacted version).
No tokens, secrets, or auth headers.

Traces are usually retained shorter than logs (7-30 days is typical), but they get inspected by more people. The bar for sensitive data should be at least as high.

Cost shape

Tracing costs money. Per-span overhead is typically 0.5-2 ms for in-process spans (fine) but exporting them to a remote backend has network and storage costs. Budget:

A typical agent session has 20-50 spans.
At 100% sampling, that's 20-50 spans per request.
At 1000 requests per minute, that's 20k-50k spans per minute.
Storage and backend costs scale linearly.

Sampling rates are how you control this. The key insight is that you don't need every trace; you need every interesting trace. Sampling done well captures the interesting ones while dropping the boring ones.

Wire tracing in early

Adding tracing to a code base after the fact is painful. Adding it from day one costs nearly nothing. If your agent codebase doesn't have tracing yet, this is one of the highest-leverage things to add. Future-you, debugging a 2am incident, will be grateful.

Key takeaway

Traces show one request's structure: nested spans, timings, parent-child relationships, attributes. OpenTelemetry is the standard SDK. Span boundaries are agent.session, agent.turn, model.call, tool.call, plus a few others. Propagate trace context across MCP and HTTP boundaries. Sample by sending all errors and a slice of normal traffic. Traces complement logs (which give details) and metrics (which give trends). The next lesson covers metrics, dashboards, and alerting on top of all this telemetry.

>_tracing.py

Loading editor...

Output will appear here.

Done with this lesson?

Structured logging for agent systems

Observability

Dashboards and alerting

Observability