Lesson 14 of 14Track 4

Deployment and scaling

Human-in-the-loop patterns

When and how to involve humans in agent workflows.

Video lesson Interactive exercise ~10 min

Video coming soon

When the agent isn't enough

Track 2 Module 5 lesson 3 introduced approval gates: the agent pauses on high-risk actions and waits for a human to confirm. That's one shape of human-in-the-loop. There are others, and in production agent systems most of them coexist. This lesson covers the full set: when each shape is right, how to wire them up, and how to keep human-in-the-loop from becoming the bottleneck of your system.

Five shapes of human-in-the-loop

1. Approval (gating a specific action)

The agent proposes a tool call; a human approves or denies. Covered in Track 2 Module 5 lesson 3. Best for irreversible or expensive actions.

2. Disambiguation (asking before deciding)

The agent doesn't know what the user wants. It asks. The human answers. Best for ambiguous requests where guessing wrong is more costly than asking. Track 2 Module 6 lesson 3 (confidence estimation) is the natural trigger.

3. Verification (signing off on a result)

The agent produces a draft; a human reviews before the draft is committed. Best for outputs that go to external audiences (customers, public posts) where mistakes are visible.

4. Escalation (handing off completely)

The agent gives up on a task and routes it to a human. Best for cases where the agent isn't the right tool: complex emotional situations, novel problem types, anything outside its training distribution.

5. Correction (overriding an output)

The agent's answer is sent to the user; the user (or a downstream operator) marks it as wrong; the correction feeds back into evals or fine-tuning. Best for steady-state quality improvement.

These compose. A single agent can have approval on destructive tools, disambiguation when uncertain, verification on customer-facing drafts, escalation on out-of-scope queries, and correction logging on every output.

Designing the human's interface

The single biggest determinant of whether human-in-the-loop works is the UX of the human's surface. Three principles:

Show enough, not everything

A human reviewing an approval needs the proposed action, the reason, the relevant context, and the risk class. They don't need the full conversation transcript. Compress.

Surface decisions clearly

Buttons and structured choices, not free-text. "Approve / Deny / Modify" with an optional comment field beats "respond however you'd like."

Time-box decisions

Show how long the agent has been waiting. After a TTL, the request expires (or auto-denies, or auto-escalates further). Stale approvals are worse than no approvals.

A reasonable approval UI surface (Slack message, web UI, whatever):

[REVIEW REQUEST]
Agent: customer-support-agent
Action: refund(order_id="O-9421", amount_usd=350)
Reason: customer reported defective product, attached photos confirm
Plan step: 3 of 4
Context: customer has 6 prior orders, no prior refunds
Risk: financial (irreversible)
Expires in: 30 minutes
 
[Approve] [Deny + reason] [Escalate to manager]

A human can decide in 5-10 seconds with this. A wall of transcript text makes them less likely to look closely; a too-terse summary makes them ask the agent to re-prepare.

Async vs sync approvals

Same split as Track 2 Module 5 lesson 3:

Sync: agent blocks until response. Fine for short waits and dev environments.
Async: agent persists state, exits, and resumes when a callback fires (Track 2 Module 4 lesson 3 on checkpointing covers the resume mechanics).

Production: always async. Anything that blocks a worker thread for a human's response is a wasted resource.

Where humans plug in

Operationally, two patterns:

In-app

The user themselves is the human in the loop. They see the approval/disambiguation prompt; they respond; the agent continues. Simplest; the user is always available because they're driving.

Out-of-band

A different human (a moderator, an ops person, an admin) handles the human-in-the-loop interactions. The user is unaware until the agent completes. Used when:

The action affects more than just one user.
The user lacks expertise to make the call.
Compliance requires sign-off from someone who isn't the requester.

For consumer products, in-app is most common. For B2B and ops tooling, out-of-band shows up frequently.

Avoiding HITL bottlenecks

Human-in-the-loop is helpful only if humans can keep up. Common failure modes:

Approval queue grows unbounded

If 1000 actions per hour trigger approvals and only one human is staffed, the queue runs away. Fix: scope approvals tighter (Track 2 Module 5 lesson 3), or staff up, or accept some level of automatic approval below a confidence threshold.

Approval fatigue

Humans rubber-stamp because most actions are fine. Approvals lose meaning. Fix: gate fewer things. Surface only the genuinely uncertain or risky. Track approval/denial rates and adjust.

Out-of-hours degradation

Approvals don't happen at 3am. The whole system stalls overnight. Fix: explicit SLOs for approval response, fallback policies (auto-deny, escalate to oncall, queue until morning), and monitoring on queue depth.

HITL as a learning loop

Human responses are training data:

log.info("hitl_decision", extra={
    "request_id": req_id,
    "user_id": uid,
    "action_type": "approval",
    "agent_proposed": action,
    "agent_reason": reason,
    "human_decision": decision,
    "human_feedback": feedback,
    "decision_latency_ms": elapsed,
})

Each decision tells you something:

Approval rate over 99%: the gate is probably unnecessary.
Denial rate above 10%: the agent is proposing too aggressively; tighten upstream.
Specific consistent denials: a class of agent decisions is wrong; investigate.
Long decision latency: the surface isn't usable; redesign.

This data feeds back into your eval set (the cases where humans disagreed with the agent are evidence of failures the agent is making) and into prompts/policies (clear corrections become rules).

Disambiguation specifically

Disambiguation is asking the user before deciding. The trick is asking only when it actually helps:

Don't ask when the right answer is obvious.
Don't ask when the user provided enough context already.
Don't ask when both options would be acceptable.

async def respond(question, context):
    confidence = await estimate_confidence(question, context)
    if confidence < THRESHOLD["disambiguate"]:
        return await ask_user(question, options=top_k_alternatives)
    return await answer(question, context)

A frequent failure mode: agents that ask "to confirm, did you mean X?" on every request. This is annoying when X is the only reasonable interpretation. Calibrate confidence; only ask when there's genuinely ambiguity.

Verification and review queues

For agent outputs that ship to customers (sent emails, published content, automated reports), a verification step before the output goes out is often the right call.

Two patterns:

Sample-based

A random fraction of outputs is reviewed; the rest go through automatically. Catches systemic issues; misses individual mistakes. Fine for low-stakes, high-volume content.

Risk-based

Outputs above a risk threshold (specific topics, specific recipients, low confidence) are reviewed; the rest go through. Targeted; protects the cases that matter most.

For most production agent systems doing automated content, risk-based review is the right pattern. The agent's confidence and self-classification of risk drive who reviews what.

Wrapping up Track 4 and the curriculum

This module closes Track 4 and the entire course:

Module 1 (reliability): failure taxonomy, retries, validation, runtime policy.
Module 2 (evaluation): testing nondeterministic systems, eval datasets, eval pipelines.
Module 3 (observability): logs, traces, metrics, alerts.
Module 4 (deployment): containers, queues, cost, human-in-the-loop.

Across four tracks you've now covered the full vertical: from agent fundamentals (Track 1) through orchestration patterns (Track 2), MCP (Track 3), and production (Track 4). Each track depends on the ones before it: production patterns assume orchestration; orchestration assumes a working single-agent loop. Reverse order works too: every fundamental from Track 1 can be deepened with the production-grade patterns from Track 4.

You now have a coherent picture of what it takes to build and run agents at scale: the loop, the topologies, the protocols, the safety, the reliability, the evaluation, and the operational story. The frontier from here is mostly about applying these patterns to specific domains, your own product, your own users.

Build the simplest version first; then layer

The best lesson of this whole curriculum: every pattern you've learned exists to solve a specific failure mode. None of them is mandatory until that failure mode shows up. Start with a monolithic agent. Watch where it breaks. Apply the next-up pattern. Repeat. Most production agent systems end up using maybe 30% of the patterns in this curriculum heavily and the rest occasionally. Knowing the catalog is what lets you reach for the right one when you need it; build first, evolve into complexity.

Key takeaway

Human-in-the-loop has five shapes: approval, disambiguation, verification, escalation, correction. Each fits a different failure or risk class. The UX of the human's surface determines whether it works in practice; show enough but not everything, surface clear decisions, time-box. Run async with checkpointing for long delays. Track human decisions as training data; use them to refine the agent. With this, you've completed the curriculum: from the agent loop to a production system that ships, scales, and stays trustworthy.

>_human-in-the-loop.py

Loading editor...

Output will appear here.

Done with this lesson?

Cost management and rate limiting

Deployment and scaling

Complete

Back to track overview