Lesson 10 of 14Track 4

Observability

Dashboards and alerting

What to monitor, what to page on.

Interactive exercise ~10 min

What to chart and what to page on

Logs are for debugging. Traces are for understanding one request. Metrics and dashboards are for steady-state observability: the at-a-glance "is this thing healthy" view. Alerts are the subset of those metrics that wake people up.

This lesson covers what metrics matter for agent systems, what dashboards to build, and what's worth paging on. The patterns aren't agent-specific (every web service has metrics and alerts) but the which metrics matter is.

Metrics worth tracking

Sort them into four groups:

Health metrics

"Is the system working?"

  • Request rate (per minute). Are requests coming in?
  • Error rate (% of requests). Are most of them failing?
  • p50 / p95 / p99 latency. Is it responsive?
  • Active sessions. How many concurrent agent loops are running?

These are the basics. Every web service has them; agent systems are no different.

Quality metrics

"Are the outputs good?"

  • Eval pass rate (from the eval pipeline; sample real production traffic if you can).
  • Self-reported confidence distribution. How sure is the agent of its own answers?
  • Validation failure rate. How often is output validation rejecting?
  • Approval rejection rate (when human approvals are involved).

Quality metrics drift slowly; you'll watch trends, not real-time spikes.

Cost metrics

"How expensive is this?"

  • LLM cost per request. Tokens-in + tokens-out priced per model.
  • MCP server calls per request. Coupled to vendor pricing.
  • Cost per session by task type. Which workloads are draining money.
  • Cost per user (or per tenant). Useful for billing and abuse detection.

These matter from day one. Agent systems can develop alarming cost profiles fast.

Behavioral metrics

"What is the agent actually doing?"

  • Tools called per session, distribution.
  • Iterations per session. How many turns does the agent take?
  • Strategy switches per session (Track 2 Module 6).
  • Stuck-loop incidents (Track 3 Module 4 thrash detector).
  • Fallback rate. How often is the agent using its B-plan?

Behavioral metrics tell you whether the shape of agent work is changing, not just whether it's working.

Dashboards I'd build

In rough priority order:

1. Agent health overview

One screen showing: request rate, error rate, p95 latency, active sessions. This is the front door.

2. Per-tool / per-MCP-server health

Same metrics, broken down by server and tool. Lets you see "GitHub MCP is timing out" without digging.

3. Cost breakdown

Cost over time, by request type, by user, by model. Includes a top-10 most-expensive sessions today list.

4. Quality trends

Pass rates, validation failure rates, approval rejection rates over time. Updated daily from the eval pipeline.

5. Behavioral

Tools-per-session distribution. Iterations-per-session distribution. Stuck-loop detections. The "is the shape of work normal" dashboard.

The first two get most of the attention. The last three get attention when something feels off.

Alerts

Two kinds:

Page-the-engineer alerts

Wake someone up. Reserve for:

  • Error rate above a threshold for several minutes.
  • p99 latency above a threshold sustained.
  • The agent is unavailable (cannot create new sessions).
  • Critical dependency is down (the model API is unreachable).
  • Stuck-loop detection rate spiking (suggests a deploy regression).

These should be rare and real. Anything that pages but doesn't deserve to be paged for trains people to ignore alerts.

Dashboard / Slack alerts

Notify a channel. Use for:

  • Eval scores dropping after a deploy.
  • Cost spike for a single user.
  • A new error pattern showing up in logs.
  • Approval rejection rate climbing.

These don't need an immediate response but someone should look soon.

Common alert mistakes

Alerting on absolute counts

"Page if error count exceeds 10/min" sounds reasonable until traffic doubles. Use rates (% of requests), not raw counts.

Alerting on instantaneous values

A single 5-second spike doesn't matter. Use averages over a window (5 to 15 minutes) or "for X minutes in the last hour" formulations.

Alerting on every error

Some errors are transient and self-heal. Alert on persistent error rates, not on individual errors. Logs and traces are for individual events.

Alerting on metrics that don't have clear actions

If you wake someone up, they should know what to do. "Tool X latency is high" without context is a useless page. Either include the runbook in the alert or don't alert on it.

SLOs

Service Level Objectives are the formal contract: "99% of requests complete in under 5 seconds, measured weekly." They're how you decide when to invest in reliability versus features.

For agent systems, useful SLOs are:

  • Availability: % of requests that produce a final answer (success or graceful failure, not a crash). 99-99.9% is realistic.
  • Latency: p95 latency for normal requests. Set based on what users tolerate.
  • Quality: eval pass rate above a floor. Slower-moving than the others; usually weekly or per release.

When your SLO is in danger, you stop shipping features and fix reliability. When you have plenty of error budget, you can move fast. This isn't agent-specific; it's the standard SRE pattern, applied to agents.

Things specific to agents that catch you

Some metrics and alerts that aren't always on default dashboards:

Token-cost spike alerts

A single bad prompt change can quadruple token usage. Alert on cost-per-request crossing a threshold. Catches regressions before they bankrupt the project.

Iteration-count alerts

A regression that makes the agent take 30 turns instead of 5 won't always trip latency alerts (each turn is fast) but will trip cost and quality alerts. Surface "average iterations per session" prominently.

Tool-selection-mix alerts

If your agent suddenly stops using tool X (or starts heavily using tool Y), something probably changed. Alert on big shifts in tool-mix, not just on tool errors.

Stuck-loop spikes

Track 3 Module 4 introduced the thrash detector. Wire its alerts into your paging system. A spike in stuck-loop detections is usually a regression in the agent's prompt or planner.

Connecting the layers

Logs, traces, metrics, and alerts work together:

  • Metrics flag that something's wrong.
  • Dashboards narrow down which component.
  • Traces show what one bad request was doing.
  • Logs show the details of what happened in that trace.

When all four are wired up to the same IDs (request_id, session_id), you can hop between them in seconds. When they aren't, debugging becomes much slower.

Wrapping up observability

This module covered:

  • Lesson 1: structured logging, what to log, what to redact.
  • Lesson 2: tracing, span shapes, propagation, sampling.
  • Lesson 3 (this one): metrics, dashboards, alerts.

The pattern is layered: logs are the lowest level (one event per line); traces are the structured view of one request; metrics are the aggregated steady-state view; alerts are the subset of metrics that warrant action. Wire all of them with shared IDs; budget for sampling; treat the observability stack as a real product.

The next module is the last: deployment and scaling. Containerization, queues, cost control, human-in-the-loop. Many of those decisions interact directly with the observability you've just built.

Build dashboards before incidents, not during them

The single biggest observability mistake is putting off dashboards until you actually need them. During an incident is the worst time to build a dashboard: you're stressed, you don't know what you're looking for, the data isn't shaped for the question you have. Build dashboards in advance for the metrics you know will matter. The cost is a few hours; the benefit is being able to think during an incident instead of fight your tooling.

Key takeaway

Metrics and dashboards turn observability data into steady-state visibility: health, quality, cost, behavior. Alerts page only on real, actionable problems with clear runbooks. Agent-specific metrics worth highlighting include token cost per request, iterations per session, and tool-selection mix. SLOs let you decide when to ship features versus fix reliability. With logs + traces + metrics + alerts wired to shared IDs, debugging shrinks from hours to minutes.

>_dashboards-alerting.py
Loading editor...
Output will appear here.

Done with this lesson?