State management
Checkpointing and resumability
Making agent workflows survive restarts.
Surviving restarts
Long-running agent workflows have a problem nobody talks about until the first time it bites: processes die. The orchestrator crashes mid-task. The container gets reaped. The user closes the browser. Your agent had completed nine of ten steps, and then it ate dirt and lost everything.
Checkpointing fixes this. After every interesting transition, you persist enough state to resume. When the system restarts, it reloads the last checkpoint and picks up where it left off. The state machine from the previous lesson is what makes this clean: the state cursor is the resumption point.
What to checkpoint
A reasonable checkpoint contains:
- The current state (PLANNING, EXECUTING, etc).
- The plan (steps remaining, current cursor).
- Committed results so far (whatever the orchestrator has accepted).
- The original user request and any session-level context.
- A monotonically increasing version number so you can detect partial writes.
Things you do not checkpoint:
- The inner loop's working memory mid-tool-call. If a worker crashes mid-call, restart the worker from the start of its current handoff.
- Tool-call traces beyond what the supervisor needs.
- Model parameters, API keys, anything that should come from configuration.
The checkpoint should be small enough to write atomically and large enough to resume cleanly. A few KB per checkpoint is normal.
When to checkpoint
The natural moments are state-machine transitions. Specifically:
- After PLANNING completes (you have a fresh plan).
- After OBSERVING records a worker's result (you have a new finding).
- After REFLECTING decides the next move (you know what's coming).
Do not checkpoint during EXECUTING (a worker's middle-of-loop state is not yours to save) or during the inner-loop tool calls.
def drive_with_checkpoint(loop, store):
while loop.state not in (State.COMPLETE, State.FAILED):
step_once(loop) # one transition's worth of work
store.save(loop.snapshot()) # checkpoint after every transition
store.save(loop.snapshot()) # final checkpointThe checkpoint after every transition is overkill for some systems and exactly right for others. The threshold is roughly: how much work do you tolerate losing in a crash? For a 30-second agent run, checkpointing every step is unnecessary. For a 30-minute run, it is mandatory.
Atomic writes
A half-written checkpoint is worse than no checkpoint. The standard fix is write-then-rename:
def save_atomic(snapshot, path):
tmp = path + ".tmp"
with open(tmp, "w") as f:
f.write(json.dumps(snapshot))
os.replace(tmp, path) # atomic on POSIXThis guarantees that any reader either sees the old checkpoint or the new one, never a half-written one. For databases, use a transaction with a single update statement. For object stores, use the storage API's atomic upload semantics.
Resuming
On startup:
- Load the checkpoint.
- Validate the schema version. If incompatible, error out (do not silently mishandle).
- Reconstruct the loop's
Loopobject. - Drive it from the loaded state.
def resume(store):
snapshot = store.load()
if snapshot is None:
return None
if snapshot["version"] != CHECKPOINT_VERSION:
raise IncompatibleCheckpoint(snapshot["version"])
loop = Loop.from_dict(snapshot)
drive(loop, store)
return loopThe state cursor in the loaded snapshot tells the driver exactly which handler to run next. This is why the state machine matters: without it, "where do we resume from?" has no clean answer.
Idempotency for the inner loop
Resuming from a checkpoint means a worker that was mid-execution will be re-run. If the worker has side effects (sending email, calling external APIs, writing to a database), re-running it can cause duplicates: two emails, two API calls, two writes.
Two strategies:
Idempotency keys
Each side-effecting tool call carries a deterministic key. The receiving system checks if it has seen the key and no-ops on duplicates. The same key in two different worker runs produces the same outcome (one effect, not two).
Plan-level idempotency
Build the plan such that every step is naturally repeatable. "Read file X" is fine; you can do it twice. "Send the user an email" is not; you cannot. For non-idempotent actions, push them to the very end of the plan, after all decisions are committed, so a crash before them just means re-deciding (cheap), not re-acting (expensive).
For most agent workflows, a small set of side-effecting tools deserves the idempotency-key treatment: any tool that posts, sends, deploys, charges, or modifies external state. The rest can be re-run safely.
Storage choices
| Storage | Pros | Cons | Right for |
|---|---|---|---|
| Local file | Fast, simple | Lost if container dies | Dev, single-host |
| Redis | Fast, supports TTL | Memory cost; need persistence config | Most production |
| Postgres | Durable, queryable, transactional | Heavier API | Long-lived workflows, audit needs |
| Object store (S3) | Cheap, durable | Higher latency | Long-running batch agents |
Redis is the typical default for online agent workloads. Postgres if you want to query checkpoints (e.g., admin dashboard showing all in-progress agents). Object store for batch jobs where checkpoints are only read on restart.
TTLs and cleanup
Checkpoints accumulate. Without a TTL or cleanup, your store fills up with finished and abandoned runs. Two policies:
- Delete on COMPLETE / FAILED. As soon as a loop terminates, drop its checkpoint.
- TTL after last write. Any checkpoint not touched for N hours gets auto-deleted.
The second is more forgiving (a temporarily-stalled run can resume), but the first is cleaner if you have it. A combination is fine: delete on terminal state plus a 24-hour TTL as a safety net.
Checkpointing is not transactional with the world
A checkpoint records what your orchestrator believes happened, not what actually happened. If your worker sent an email and crashed before the supervisor got the result, the checkpoint will not show the email-send. On resume you may send a second email. Idempotency keys are the only real fix; checkpointing alone cannot give you exactly-once semantics for external side effects.
A minimal checkpoint protocol
@dataclass
class Snapshot:
version: int
request: str
state: str
plan: list
cursor: int
results: list
def checkpoint(loop, store):
snap = Snapshot(
version=1,
request=loop.request,
state=loop.state.name,
plan=loop.plan,
cursor=loop.cursor,
results=loop.results,
)
store.save_atomic(snap)That is enough to resume any of the loops we have written so far. The next module covers safety and control: how to make sure a resumed loop does not, for instance, charge a customer twice because the agent decided "well, last time we saw the cart it was unpaid."
Key takeaway
Checkpointing turns a fragile long-running loop into a resumable one. Save the state cursor, the plan, and committed results after every transition. Use atomic writes. Make tools idempotent for any side-effecting action. The state machine from the previous lesson is what makes resumption clean: with it, "where do we resume from" is well-defined; without it, you are guessing. The next module is about controlling what those resumed (or fresh) loops are allowed to do.
Done with this lesson?