Checkpointing and resumability

Making agent workflows survive restarts.

Interactive exercise ~10 min

Surviving restarts

Long-running agent workflows have a problem nobody talks about until the first time it bites: processes die. The orchestrator crashes mid-task. The container gets reaped. The user closes the browser. Your agent had completed nine of ten steps, and then it ate dirt and lost everything.

Checkpointing fixes this. After every interesting transition, you persist enough state to resume. When the system restarts, it reloads the last checkpoint and picks up where it left off. The state machine from the previous lesson is what makes this clean: the state cursor is the resumption point.

What to checkpoint

A reasonable checkpoint contains:

The current state (PLANNING, EXECUTING, etc).
The plan (steps remaining, current cursor).
Committed results so far (whatever the orchestrator has accepted).
The original user request and any session-level context.
A monotonically increasing version number so you can detect partial writes.

Things you do not checkpoint:

The inner loop's working memory mid-tool-call. If a worker crashes mid-call, restart the worker from the start of its current handoff.
Tool-call traces beyond what the supervisor needs.
Model parameters, API keys, anything that should come from configuration.

The checkpoint should be small enough to write atomically and large enough to resume cleanly. A few KB per checkpoint is normal.

When to checkpoint

The natural moments are state-machine transitions. Specifically:

After PLANNING completes (you have a fresh plan).
After OBSERVING records a worker's result (you have a new finding).
After REFLECTING decides the next move (you know what's coming).

Do not checkpoint during EXECUTING (a worker's middle-of-loop state is not yours to save) or during the inner-loop tool calls.

def drive_with_checkpoint(loop, store):
    while loop.state not in (State.COMPLETE, State.FAILED):
        step_once(loop)              # one transition's worth of work
        store.save(loop.snapshot())  # checkpoint after every transition
    store.save(loop.snapshot())      # final checkpoint

The checkpoint after every transition is overkill for some systems and exactly right for others. The threshold is roughly: how much work do you tolerate losing in a crash? For a 30-second agent run, checkpointing every step is unnecessary. For a 30-minute run, it is mandatory.

Atomic writes

A half-written checkpoint is worse than no checkpoint. The standard fix is write-then-rename:

def save_atomic(snapshot, path):
    tmp = path + ".tmp"
    with open(tmp, "w") as f:
        f.write(json.dumps(snapshot))
    os.replace(tmp, path)   # atomic on POSIX

This guarantees that any reader either sees the old checkpoint or the new one, never a half-written one. For databases, use a transaction with a single update statement. For object stores, use the storage API's atomic upload semantics.

Resuming

On startup:

Load the checkpoint.
Validate the schema version. If incompatible, error out (do not silently mishandle).
Reconstruct the loop's Loop object.
Drive it from the loaded state.

def resume(store):
    snapshot = store.load()
    if snapshot is None:
        return None
    if snapshot["version"] != CHECKPOINT_VERSION:
        raise IncompatibleCheckpoint(snapshot["version"])
    loop = Loop.from_dict(snapshot)
    drive(loop, store)
    return loop

The state cursor in the loaded snapshot tells the driver exactly which handler to run next. This is why the state machine matters: without it, "where do we resume from?" has no clean answer.

Idempotency for the inner loop

Resuming from a checkpoint means a worker that was mid-execution will be re-run. If the worker has side effects (sending email, calling external APIs, writing to a database), re-running it can cause duplicates: two emails, two API calls, two writes.

Two strategies:

Idempotency keys

Each side-effecting tool call carries a deterministic key. The receiving system checks if it has seen the key and no-ops on duplicates. The same key in two different worker runs produces the same outcome (one effect, not two).

Plan-level idempotency

Build the plan such that every step is naturally repeatable. "Read file X" is fine; you can do it twice. "Send the user an email" is not; you cannot. For non-idempotent actions, push them to the very end of the plan, after all decisions are committed, so a crash before them just means re-deciding (cheap), not re-acting (expensive).

For most agent workflows, a small set of side-effecting tools deserves the idempotency-key treatment: any tool that posts, sends, deploys, charges, or modifies external state. The rest can be re-run safely.

Storage choices

Storage	Pros	Cons	Right for
Local file	Fast, simple	Lost if container dies	Dev, single-host
Redis	Fast, supports TTL	Memory cost; need persistence config	Most production
Postgres	Durable, queryable, transactional	Heavier API	Long-lived workflows, audit needs
Object store (S3)	Cheap, durable	Higher latency	Long-running batch agents

Redis is the typical default for online agent workloads. Postgres if you want to query checkpoints (e.g., admin dashboard showing all in-progress agents). Object store for batch jobs where checkpoints are only read on restart.

TTLs and cleanup

Checkpoints accumulate. Without a TTL or cleanup, your store fills up with finished and abandoned runs. Two policies:

Delete on COMPLETE / FAILED. As soon as a loop terminates, drop its checkpoint.
TTL after last write. Any checkpoint not touched for N hours gets auto-deleted.

The second is more forgiving (a temporarily-stalled run can resume), but the first is cleaner if you have it. A combination is fine: delete on terminal state plus a 24-hour TTL as a safety net.

Checkpointing is not transactional with the world

A checkpoint records what your orchestrator believes happened, not what actually happened. If your worker sent an email and crashed before the supervisor got the result, the checkpoint will not show the email-send. On resume you may send a second email. Idempotency keys are the only real fix; checkpointing alone cannot give you exactly-once semantics for external side effects.

A minimal checkpoint protocol

@dataclass
class Snapshot:
    version: int
    request: str
    state: str
    plan: list
    cursor: int
    results: list
 
 
def checkpoint(loop, store):
    snap = Snapshot(
        version=1,
        request=loop.request,
        state=loop.state.name,
        plan=loop.plan,
        cursor=loop.cursor,
        results=loop.results,
    )
    store.save_atomic(snap)

That is enough to resume any of the loops we have written so far. The next module covers safety and control: how to make sure a resumed loop does not, for instance, charge a customer twice because the agent decided "well, last time we saw the cart it was unpaid."

Key takeaway

Checkpointing turns a fragile long-running loop into a resumable one. Save the state cursor, the plan, and committed results after every transition. Use atomic writes. Make tools idempotent for any side-effecting action. The state machine from the previous lesson is what makes resumption clean: with it, "where do we resume from" is well-defined; without it, you are guessing. The next module is about controlling what those resumed (or fresh) loops are allowed to do.

>_checkpointing.py

Loading editor...

Output will appear here.

Done with this lesson?

Modeling the orchestration loop as a state machine

State management

Tool whitelisting and blacklisting

Agent safety and control