Reliability
Why agents fail: taxonomy of failure modes
Understanding the ways agents break.
Video coming soon
Why agents fail
Building reliable agents starts with knowing the ways they break. The failures aren't random; they cluster into a small number of patterns, each with its own diagnosis and its own fix. This lesson is the taxonomy. The next three lessons of this module address the most common categories.
The taxonomy splits agent failures into five buckets:
- Tool failures. Something inside a tool went wrong.
- Reasoning failures. The model produced an output that's wrong even though everything around it worked.
- Loop failures. The orchestration loop itself misbehaves: stuck, looping, exiting too early.
- State failures. The agent's memory or shared state got corrupted, lost, or out of sync.
- External failures. A network, a service, or a dependency outside the agent broke.
Most production incidents fall cleanly into one of these. Confusing categories is the most common cause of misdiagnosing problems and applying the wrong fix.
Tool failures
The agent called a tool. The tool ran. Something didn't work.
Sub-categories:
- Argument mismatch. The model passed args that the tool doesn't accept (wrong types, missing required, extras).
- Downstream failure. The tool calls an external API that returned 500. The DB query timed out. The file didn't exist.
- Resource exhaustion. Rate limit hit. Quota exceeded. Disk full.
- Logic bug in the tool. The tool itself has a defect.
How they show up: structured tool errors (good), exceptions bubbling up (bad), or silently wrong results (worst).
How to diagnose: per-tool error rates and a sample of recent failures with full args. Track 3 Module 4 covers the observability side.
How to fix:
- For argument mismatch: better tool schemas and descriptions, validate at the executor.
- For downstream failure: retries with backoff (next lesson).
- For resource exhaustion: budgets and graceful degradation.
- For logic bugs: regular old debugging.
Reasoning failures
The tools worked. The state was clean. The model produced something wrong anyway.
Sub-categories:
- Hallucinated facts. Made up data, invented citations, claimed to have done something it didn't.
- Wrong tool selection. Picked a near-miss tool instead of the right one.
- Plan errors. The plan skipped a necessary step or got the order wrong.
- Format errors. Output doesn't match the expected schema.
- Off-policy responses. Said something it wasn't supposed to (rude, off-brand, harmful).
How they show up: the agent's answer is plausible but wrong, or the answer is correct in shape but wrong in content.
How to diagnose: evals (Module 2 of this track). You can't catch reasoning failures in production logs alone; you need a labeled set with right answers.
How to fix:
- For hallucinations: ground in tool outputs, ask for citations, add a verifier (Module 1 lesson 3).
- For wrong tool selection: tighter scopes, better descriptions, fewer tools at once.
- For plan errors: explicit reflection step (Track 2 Module 6 lesson 1).
- For format errors: schema validation with retry (Module 1 lesson 3).
- For off-policy: output guardrails (Track 2 Module 5 lesson 4).
Loop failures
The model and tools were fine. The orchestration loop didn't run them right.
Sub-categories:
- Stuck loop. Agent calls the same tool with the same args over and over.
- Premature exit. Agent stops before the task is done.
- Runaway loop. Agent does N times more work than expected; cost spikes.
- Wrong handoff target. Supervisor dispatches to the wrong worker.
How they show up: latency spikes, cost spikes, or short answers that don't address the full request.
How to diagnose: traces. The trace shape tells you exactly which kind of loop failure happened. Repeated identical calls = stuck. Trace ends after one turn = premature exit. Trace has 30 turns = runaway.
How to fix:
- For stuck loops: thrash detection + force-exit (Track 3 Module 4 covered the detector).
- For premature exit: better termination prompt or explicit exit conditions in the state machine (Track 2 Module 4 lesson 2).
- For runaway loops: max-iteration budget and per-tool budgets.
- For wrong handoff: better intent in the planner prompt; tighter worker descriptions.
State failures
The loop ran. The tools worked. The state in between got mangled.
Sub-categories:
- Lost state. Something was supposed to be remembered; wasn't.
- Stale state. Cached data that should have been invalidated wasn't.
- Concurrent writes. Two agents writing the same field, last-write-wins gave wrong result.
- Schema drift. Persisted state has an old shape; loader doesn't handle it.
- Checkpoint corruption. Save file is half-written, lost, or unreadable.
How they show up: weird inconsistencies. The agent acts as if it doesn't know something it should, or knows something stale.
How to diagnose: dump the state at the moment of confusion and compare to expectation. Hard without good observability into the state object.
How to fix:
- For lost state: shared blackboard, single-writer slices (Track 2 Module 4 lesson 1).
- For stale state: explicit invalidation, TTLs.
- For concurrent writes: per-slice locks or single-writer per slice.
- For schema drift: versioned snapshots, migration logic on load (Track 2 Module 4 lesson 3).
- For checkpoint corruption: atomic writes (Track 2 Module 4 lesson 3).
External failures
The agent and its dependencies were fine. The world outside broke.
Sub-categories:
- Network outage. DNS down, packet loss, TLS errors.
- Dependency service down. The vendor's MCP server is offline.
- Auth lapse. Token expired, refresh failed.
- Quota exhausted. Hit a vendor's API limit.
How they show up: most calls fail with the same error. If only one call fails, it's probably tool failure (above); if all of them fail, it's external.
How to diagnose: the error is uniform across calls and matches a known external incident.
How to fix:
- For network: retries, circuit breakers.
- For dependency down: fallback servers, degraded mode (Module 1 lesson 2).
- For auth lapse: proactive refresh, alarm on token expiry approach.
- For quota: budgets in your own system that fire before the vendor's do.
Why bother with the taxonomy
Each category has different signals, different fixes, different MTTR. Misclassifying a state failure as a reasoning failure leads you to twiddle prompts when the real issue is concurrent writes. Misclassifying a tool failure as an external failure leads you to add retries when the real issue is a permanent argument mismatch.
When an incident happens, the first question is "which category is this?" The right diagnostic and the right fix follow from that.
A cheat sheet
| Symptom | Likely category | First diagnostic |
|---|---|---|
| Same wrong answer every time | Reasoning | Run evals on this case |
| Different wrong answer every time | Reasoning (high-temp) or state | Set temp=0 and re-run |
| Tool errors in logs | Tool | Inspect args + tool internals |
| Latency spike, no errors | Loop (runaway) or external (slow) | Check trace shape |
| Cost spike, no errors | Loop (runaway) | Iteration counts per session |
| Some users unaffected | Tool or external | Compare working vs failing users' traces |
| Suddenly broken at exact time | External | Check vendor status |
| Worked yesterday, not today | State (schema drift) or external | Version history and vendor status |
This isn't exhaustive but it covers the bulk of incidents. Memorize the rough mapping; it speeds up real diagnosis.
What this module will do
The remaining lessons in this module address the most actionable categories:
- Lesson 2 (retries and fallbacks): handles tool failures and external failures.
- Lesson 3 (output validation): handles reasoning failures (format and content).
- Lesson 4 (runtime policy): handles loop failures and dynamic constraint changes.
State failures point back to Track 2 Module 4 (state management) for the foundations; this module adds the production-tested practices on top.
Reliability is mostly about narrowing the unknown
A reliability engineer's job is to make failures more legible: faster to detect, faster to classify, faster to fix. The taxonomy is the starting point; observability (Module 3) is the eyes; evals (Module 2) are the lab. None of it makes failures impossible; all of it makes them less destructive.
Key takeaway
Agent failures cluster into five categories: tool, reasoning, loop, state, external. Each has distinct signals and distinct fixes. Diagnosing the category before applying a fix is the difference between actually solving an incident and spending hours adjusting things that aren't the problem. The next three lessons go deep on the most actionable buckets: retries/fallbacks, output validation, and runtime policy.
Done with this lesson?