Metacognition
Confidence estimation and knowing when to ask for help
When the agent should escalate vs proceed.
Knowing when to stop, escalate, or ask
The previous two lessons assumed the agent has some sense of how things are going: reflection finds problems, adaptation notices when a strategy is stalling. Both rely on implicit signals (rounds without progress, repeated critiques). Confidence estimation makes the signal explicit: the agent maintains a belief about whether its current path is going to work, and uses that belief to decide between continue, ask for help, or give up.
This is the deepest metacognitive layer in this track. It is also the one that is hardest to get right, because LLM self-reported confidence is famously badly calibrated. The trick is to use confidence as a signal, not a final answer, and to combine it with cheaper structural signals.
Two kinds of confidence
Self-reported confidence
The agent estimates how likely its current answer is to be correct, expressed as a number or category.
def self_score(answer, question, evidence):
return model.run(
system="Score 0 to 100 how confident you are this answer is correct.",
user=f"Q: {question}\nEvidence: {evidence}\nAnswer: {answer}",
)This is easy to ask for and unreliable in practice. Models tend to be overconfident, especially on tasks they got wrong. Treat self-reports as one signal among many, not as a calibrated probability.
Structural confidence
Indicators that don't require the model to introspect, derived from the loop's behavior:
- Tool-call coverage. Did the agent actually call the tools that should have been called for this kind of question? Low coverage = low confidence.
- Evidence corroboration. Are multiple sources saying the same thing, or just one?
- Reasoning length. Tasks that converge quickly are more often right; tasks that took 20 turns of bouncing around are more often wrong.
- Critique outcomes. If reflection finds no problems on the first pass, that's a signal. If it finds problems repeatedly, that's the opposite.
Structural confidence is more reliable than self-reports because it is based on observable facts about the loop, not on the model rating itself.
Combining the signals
A reasonable scoring function:
def confidence(loop):
score = 0.5 # neutral starting point
if loop.tools_called >= EXPECTED_TOOL_COUNT_FOR_TASK:
score += 0.15
if loop.distinct_evidence_sources >= 2:
score += 0.15
if loop.critique_passed_first_time:
score += 0.10
if loop.rounds_taken < EXPECTED_ROUNDS_FOR_TASK:
score += 0.10
if loop.rounds_taken > 2 * EXPECTED_ROUNDS_FOR_TASK:
score -= 0.20
if loop.last_strategy_switched_recently:
score -= 0.15
self_score = ask_model_for_confidence(loop)
score = 0.7 * score + 0.3 * self_score # blendThe structural pieces dominate (70% weight). The model's self-report contributes (30%) but doesn't drive. Numbers are illustrative; calibrate against your eval set.
What confidence is for
Once you have a confidence number, you can act on thresholds:
THRESHOLDS = {
"commit": 0.75, # confident enough to return to user
"iterate": 0.40, # not confident; reflect or adapt
"escalate": 0.20, # very low; ask user or hand to a stronger model
}
def decide(loop):
c = confidence(loop)
if c >= THRESHOLDS["commit"]:
return "commit"
elif c >= THRESHOLDS["iterate"]:
return "iterate"
else:
return "escalate"The thresholds turn confidence into routing decisions. Below 0.20: do not return this answer to the user; escalate to a human or a stronger model. 0.20 to 0.75: try one more round of reflection or strategy adaptation. Above 0.75: commit and return.
The exact thresholds depend on your task. Calibrate them by labeling a sample of past sessions and seeing which threshold gives the best precision/recall on "should this have been escalated."
Common failure modes
Overconfidence on tasks the model can't solve
The classic LLM failure: the model is confident on a question outside its capability and gives a wrong answer with no hesitation. Self-reports won't catch this; structural signals partially do (e.g., low evidence corroboration). The strongest fix is calibration data: track historical accuracy by task type and adjust confidence priors.
Underconfidence on easy tasks
Models sometimes hedge on questions they could answer correctly. This wastes tool calls (the agent reflects when it didn't need to) and produces verbose answers ("I think... but I'm not sure..."). Mitigate with a minimum threshold for asking the user; do not escalate trivial questions.
Confidence that doesn't generalize
A confidence model tuned on one task type often miscalibrates on another. Consider per-task-type confidence functions if your agent serves wildly different workloads.
Confidence and human-in-the-loop
Confidence estimation is the natural input to automatic approval gating. Instead of a hard rule like "always require approval for deploy," you can do "require approval if confidence below 0.6 on deploy." The agent confidently doing routine deploys runs through; the agent uncertainly doing a deploy gets a human check.
This is a softer pattern than the gates from Module 5 lesson 3 and isn't always right (some actions deserve unconditional approval). When it fits, though, it produces the right number of approval prompts: many for risky uncertain actions, none for confident routine ones.
Confidence as the input to escalation
The clearest use of confidence is escalation: when confidence is too low, hand off to a stronger agent, a different strategy, or a human. The cleanest way to encode this:
def maybe_escalate(loop):
if confidence(loop) < THRESHOLDS["escalate"]:
return Handoff(
from_agent=loop.agent,
to_agent="human" if loop.escalation_target == "human" else "expert-agent",
intent="Original task could not be confidently answered",
context=summarize(loop.history),
artifacts={"final_confidence": confidence(loop)},
)
return NoneThis is the metacognitive escape hatch: when the agent admits it doesn't know, the system has a path that doesn't involve guessing.
Confidence is not certainty
Self-reported confidence numbers are signals, not probabilities. Do not treat "the model says it's 90% sure" as the same as "this answer is correct 90% of the time." Calibrate against your eval data; use the model's self-report as a tiebreaker, not as truth.
Wrap-up: Track 2
This module closes Track 2. Across six modules you have moved from single agents (Module 1) to multi-agent communication (Module 2), orchestration topologies (Module 3), state and resumability (Module 4), safety and control (Module 5), and now metacognition. Each module built on the one before:
- Communication patterns assume you've split into multiple agents.
- Topologies assume you've decided how they communicate.
- State management assumes you have a topology to manage.
- Safety controls assume you have state to protect.
- Metacognition assumes you have a loop to reflect on.
Track 3 shifts gears entirely. It is about Model Context Protocol (MCP), the standard for connecting agents to external tools. Many of the patterns in this track (handoffs, scopes, observability) carry over; the implementation moves from "your tool registry" to "third-party servers your agents can connect to."
Key takeaway
Confidence estimation makes the agent's "how am I doing?" judgment explicit. Combine structural signals (tool coverage, evidence corroboration, rounds taken) with self-reports, weighted toward the structural ones because LLM self-reports are poorly calibrated. Use thresholds to route: commit, iterate, or escalate. Confidence is the natural input to soft approval gating and to escalation paths. With this, you have the full metacognitive layer that lets agents decide when to keep going and when to stop.
Done with this lesson?