Reliability
Retry strategies and fallbacks
Graceful degradation for agent systems.
Video coming soon
Trying again, smarter
The previous lesson named the failure categories. The most common in-production failures are transient: a tool fails once and would succeed if you tried again. This lesson covers the patterns that turn transient failures into invisible blips: retries, backoff, jitter, circuit breakers, and graceful fallbacks.
These are not agent-specific patterns. They're the standard distributed-systems toolkit. What's specific is how to apply them inside an agent loop without amplifying the problem you're trying to fix.
When to retry
Retry only when the failure could plausibly resolve on its own with no change to the inputs. A 500 from a server, a TCP reset, a timeout: yes. A 4xx from a server, a bad-arguments error, a permission-denied: no.
TRANSIENT_ERRORS = {500, 502, 503, 504, "timeout", "connection_reset"}
def is_retryable(error):
if hasattr(error, "status") and error.status in TRANSIENT_ERRORS:
return True
if isinstance(error, (TimeoutError, ConnectionError)):
return True
return FalseIf the error is permanent (the file doesn't exist; the user lacks permission; the input is malformed), retrying is just wasted work. Worse, if the agent retries indefinitely on a permanent error, you have a runaway loop.
Exponential backoff with jitter
The basic algorithm: wait, double, wait again, double, etc. Capped at some maximum.
def backoff(attempt, base=0.1, cap=30):
return min(cap, base * (2 ** attempt))Without jitter, all clients retry at the same time and pile back into the system that just broke. Jitter spreads the retries.
import random
def backoff_jittered(attempt, base=0.1, cap=30):
return random.uniform(0, min(cap, base * (2 ** attempt)))(Full jitter: pick a random value from [0, exp_backoff]. Other variants exist; full jitter is the simplest reasonable default.)
Retry budgets
A retry without a budget is a runaway loop in the making. Always set:
- Max attempts per call (3 to 5 is typical).
- Max total time per logical operation (regardless of attempts).
- Max retries per session. A session that's already retried 50 times is probably stuck.
async def call_with_retry(fn, *, max_attempts=3, max_total_seconds=30):
deadline = time.time() + max_total_seconds
last_err = None
for attempt in range(max_attempts):
if time.time() >= deadline:
break
try:
return await fn()
except Exception as e:
if not is_retryable(e):
raise
last_err = e
await asyncio.sleep(backoff_jittered(attempt))
raise last_errThe agent loop above this should know that a tool can fail-after-retries and have a plan for it (next section).
Circuit breakers
Retries help with transient failures. They make persistent failures worse: if a service is down, retrying every call adds load and slows the agent. Circuit breakers cap this.
A circuit breaker has three states:
- Closed. Normal operation. Calls go through.
- Open. The service is failing too often. Calls are rejected immediately for a cooldown period.
- Half-open. The cooldown elapsed. One probe call is allowed; if it succeeds, close; if it fails, re-open.
class Breaker:
def __init__(self, failure_threshold=5, recovery_seconds=60):
self.failures = 0
self.opened_at = None
self.threshold = failure_threshold
self.recovery = recovery_seconds
def state(self):
if self.opened_at is None:
return "closed"
if time.time() - self.opened_at < self.recovery:
return "open"
return "half_open"
def record_success(self):
self.failures = 0
self.opened_at = None
def record_failure(self):
self.failures += 1
if self.failures >= self.threshold:
self.opened_at = time.time()Wrap the call site:
async def call_via_breaker(breaker, fn):
state = breaker.state()
if state == "open":
raise CircuitOpenError("breaker open")
try:
result = await fn()
breaker.record_success()
return result
except Exception:
breaker.record_failure()
raiseWhen the breaker is open, the agent gets a fast failure instead of an expensive timeout. The agent's recovery logic kicks in (fallback, error message, etc) instead of compounding load on a struggling service.
Fallbacks
The agent should have a plan for "the tool I wanted to use is unavailable." Three patterns:
Cheaper backup tool
If sql_warehouse_query is down, fall back to cached_summary_query. The data is staler but the agent has something.
Different MCP server providing the same capability
If the primary GitHub MCP server is down, switch to the secondary one. Same tool surface, different host.
Degraded answer
The agent says "I couldn't reach the database; here's what I can answer without it." Better than a hung loop.
Fallbacks are decisions, not retries. They live above the retry layer:
async def query_data(args):
try:
return await call_with_retry(lambda: primary_sql.query(**args))
except (CircuitOpenError, MaxRetriesExceeded):
try:
return await call_with_retry(lambda: cached_summary.query(**args))
except Exception:
return {"status": "degraded", "data": None, "reason": "data services unavailable"}The agent receives a structured response either way and can reason about it.
Don't retry side-effecting calls without idempotency
A tool that sends an email, posts a Slack message, or charges a credit card should not be auto-retried unless it has an idempotency key. Otherwise a "retry" produces a duplicate side effect.
If your tool is send_email, the call site should look something like:
def send_email(to, subject, body, idempotency_key):
if has_seen(idempotency_key):
return previous_result_for(idempotency_key)
result = actually_send(to, subject, body)
record(idempotency_key, result)
return resultGenerate the idempotency key per logical request (not per attempt). Now retries are safe; the email goes once.
Where retries live in the loop
There are two natural places to put retries in an agent loop:
Inside the tool (closer to the failure)
Fast feedback, no model involvement. Good for transient failures the model doesn't need to know about.
At the agent layer (visible to the model)
The model sees the failure and decides what to do (retry, switch strategy, ask the user). Slower but more flexible.
The right answer is usually both: the tool retries quietly for fast transient failures; if it still fails, surface to the agent as a structured error and let the agent decide.
Failure modes of retries themselves
Retry storms
If many clients retry the same failing service simultaneously, you create a thundering herd that ensures the service stays down. Jitter mitigates; a circuit breaker fixes.
Retries hiding root causes
If you retry until success, you might hide real problems. Always log retries (with reason) so post-incident analysis can find services that succeed only after 4 retries on average.
Retries on permanent errors
A common bug: marking an error as "retryable" when it's actually permanent. The system retries forever. Validate your is_retryable carefully and skew toward not retrying when in doubt.
The agent does not know about your retries
A subtle point: when a retry succeeds, the model should see the successful result, not the retry history. When a retry exhausts, the model should see "this failed permanently" and decide what to do, not a stack of intermediate errors. Hide the retry mechanics from the model unless they affect the meaning of the result.
Key takeaway
Transient failures should be retried with exponential backoff and jitter, capped by per-call and per-session budgets. Persistent failures should trip a circuit breaker. The agent above this layer should have explicit fallbacks: cheaper tools, alternate servers, or graceful degradation. Side-effecting tools require idempotency keys before retries are safe. The next lesson tackles the other big reliability bucket: validating outputs that came back successful but might be wrong.
Done with this lesson?