Lesson 8 of 14Track 4

Observability

Structured logging for agent systems

What to log and how to structure it.

Video lesson ~10 min

Video coming soon

Logs you can query

Most agents start with print() statements scattered through the code. That works in dev. In production, when an incident is happening at 2am and you're trying to figure out what one user's session was doing, plain text logs are a wall of pain.

The fix is structured logging: every log line is a JSON object with consistent fields. You can grep, filter, aggregate, and chart logs as if they were database rows. With a few specific fields wired through your agent stack, debugging shifts from "read 10,000 lines of text" to "filter by request_id and look at the timeline."

This lesson covers what fields to log, how to propagate context, what to redact, and where logs fit alongside traces and metrics.

The minimum useful log entry

For an agent system, every log entry should have at least:

{
    "ts": "2026-05-07T15:24:31.123Z",
    "level": "info",
    "event": "tool_call_completed",   # what kind of thing happened
    "request_id": "req-a3f1",          # ID for this user request
    "session_id": "sess-42",           # ID for the agent session
    "user_id": "u-7",                  # who's affected
    "service": "agent-gateway",        # which component logged it
    "version": "v1.2.3"                # build of that component
}

These are the ID fields. Everything else (durations, statuses, sizes) is event-specific.

Event types: a few hundred is fine

A common worry: "do I need to enumerate every possible event?" No. A few dozen distinct event types is enough for most agent systems. Examples:

agent_session_started
agent_turn_started / agent_turn_completed
model_call_started / model_call_completed
tool_call_started / tool_call_completed
tool_call_denied
output_validation_failed
human_approval_requested / human_approval_decided
agent_session_completed / agent_session_failed

Each event is a noun_verb pattern. The set is small and predictable; queries become straightforward ("show me all tool_call_denied events for user X today").

Context propagation

The IDs (request_id, session_id, etc.) need to flow through every layer. The agent code generates them at the top level; tools, MCP clients, and downstream services include them in their own log lines.

class LoggingContext:
    """Holds the IDs that propagate through a request."""
    request_id: str
    session_id: str
    user_id: str
 
    def log(self, event, **fields):
        log.info(json.dumps({
            "ts": now_iso(),
            "event": event,
            **asdict(self),
            **fields,
        }))

Hand the context to every tool function (or thread/coroutine local). When a tool logs, it logs with the context. When you grep for a request_id later, you get every event from every component that touched that request.

For multi-process systems (HTTP between services), pass IDs in headers. For MCP, propagate via tool call metadata or environment.

What to log per layer

At the agent loop level

Session lifecycle (started, completed, failed).
Turn boundaries (model call started/completed, model call cost in tokens).
Tool calls (with arguments hash, never raw arguments unless you've sanitized).
State transitions (PLANNING -> EXECUTING -> ...).

At the tool / MCP layer

Tool name, server name, args hash.
Latency, status (ok, error, timeout, denied).
Result size in bytes (a hash if you want to correlate identical results).

At the model layer

Model name, version.
Input/output token counts.
Latency for the call itself.
Whether the response had tool calls or content.

At the validator layer

Each validation pass and result.
For failures: what specifically failed.

Cumulatively this is maybe 5-15 events per turn. For a 10-turn session, that's 50-150 log lines. Normal.

What NOT to log

The rule: anything that, if leaked, would harm a user or your business.

Auth tokens, passwords, API keys.
Raw PII (emails, names, addresses, phone numbers) unless your system specifically permits it.
Full message contents from end-users (a hash is usually enough; use sampling for the rare cases when you need contents).
Full tool outputs (especially from servers like databases that may return user records).

A common mistake: turning on debug logging that dumps raw request bodies, then forgetting to turn it off. Filter sensitive fields at the logger level, not at the call site, so a careless log.debug(payload) doesn't accidentally leak secrets.

SENSITIVE_FIELDS = {"password", "token", "api_key", "auth", "ssn", "credit_card"}
 
 
def redact(d):
    if not isinstance(d, dict):
        return d
    return {k: ("***" if k.lower() in SENSITIVE_FIELDS else redact(v)) for k, v in d.items()}

Apply at the logger, not at every call site. Belt-and-suspenders: check at multiple layers.

Log levels

Use levels intentionally:

DEBUG: tracing, helpful in dev, mostly off in prod (or sampled at 1%).
INFO: normal operation events; this is the bulk of your logs.
WARN: something unexpected but recoverable; investigate later.
ERROR: something failed; investigate now.
FATAL: can't continue; page someone.

Don't put noisy chatter at WARN. WARN is for "investigate later"; if you have to investigate every WARN, the level is wrong.

Sampling

Some events fire so often that logging every one is wasteful. Common patterns:

Always log errors and warns. Skipping these to save volume is the wrong place to economize.
Sample successful tool calls. Log all errors, sample 10% of successes. Captures the shape without the full volume.
Always log per-session boundaries. Even if individual turns are sampled, having the start/end of every session is anchoring.

Make sampling decisions explicit in code, not implicit in log volume.

Log shape conventions

A few small things that pay off:

Use snake_case keys

tool_call_completed, session_id, duration_ms. Consistency matters because every query is going to use the keys verbatim.

Standardize duration units

Pick ms and stick with it (duration_ms, latency_ms, wait_ms). Mixing seconds and milliseconds across logs is the kind of small inconsistency that wastes a lot of time.

Include cardinality-friendly tags

A model field with values like claude-sonnet-4-5 lets you group easily. Avoid free-text fields where you could have a controlled vocabulary; they explode dashboard cardinality.

Logs vs traces vs metrics

The three observability layers serve different questions:

Question	Best tool
What exactly happened in this request?	Logs
Where did the time go in this request?	Traces
How is the system behaving overall?	Metrics

You usually want all three. Don't skip metrics because logs feel sufficient; metrics are how you notice trends and SLO violations. Don't skip traces because metrics feel sufficient; traces are how you debug a specific weird request.

Logs sit in the middle: too detailed for at-a-glance health, too noisy to chart, but the only thing that lets you reconstruct exactly what one specific request did.

Long-term retention

Production logs add up. For agent systems, consider:

Hot tier (7-14 days): queryable, low-latency, expensive. Most debugging happens here.
Warm tier (30-90 days): queryable but slower, cheaper. Used for retrospectives.
Cold tier (1+ year): archival, slow to query, cheap. Compliance, long-term trends.

For each event type, decide the lifecycle. Audit logs for safety-critical actions might live in cold tier for years; routine tool calls might be deleted from the hot tier after a week.

Make request_id visible to the user

A small but high-leverage practice: surface the request_id in user-facing error messages. "Something went wrong (id: req-a3f1)" lets the user paste that ID into a support request. Now your team can find the entire log timeline for that session in one query. Don't expose internal details, just the ID.

Key takeaway

Structured logging makes debugging tractable: every log line is JSON, every entry has standard ID fields (request_id, session_id, user_id), event types are nouns. Propagate context through every layer; redact sensitive fields at the logger; sample successes but always log errors. Logs answer "what happened in this request"; traces answer "where did time go"; metrics answer "how is the system behaving." The next lesson goes deep on traces.

Done with this lesson?

Automated eval pipelines

Evaluation

Tracing multi-agent requests

Observability