Evaluation
Testing non-deterministic systems
The unique challenge of agent evaluation.
Video coming soon
You can't unit-test an LLM
Traditional software testing rests on a guarantee: same input, same output. Unit tests, integration tests, snapshot tests, golden tests; they all assume the system under test is deterministic. LLMs aren't, and neither are agents that wrap them.
Two LLM calls with the same input might produce slightly different outputs (different word choices, different reasoning paths). The same agent run twice on the same task might call tools in different orders or stop at different points. Tests that check for exact string equality fail randomly. Tests that just spot-check by hand miss regressions.
The agent-eval problem is: how do you know your system is getting better, not worse, when its outputs are stochastic? This module is the answer. The pieces:
- Lesson 1 (this one): how testing has to change for non-deterministic systems.
- Lesson 2: building the eval dataset (the inputs you grade against).
- Lesson 3: running the eval pipeline (judges, automation, dashboards).
What "passing an eval" means
For deterministic code, a test passes when the output equals the expected value. For agents, a test "passes" when the output meets a criterion. The criterion can be:
- Structural. Does the output have the right shape? (JSON schema, length, format.)
- Substantive. Does the answer say the right things? (Does it mention the bug? Does it cite a source? Does it not hallucinate facts?)
- Behavioral. Did the agent take a reasonable path? (Did it call the right tools? Did it stay within budget? Did it escalate when it should have?)
Multiple criteria per test case is normal. A "good" output passes all of them; a borderline output might pass some.
Three kinds of evals
Reference-based
You have a known-good answer. The eval compares the agent's output to the reference using either exact match (for structured outputs) or a similarity metric (for free-text). Easy to automate, brittle to phrasing differences.
def reference_eval(actual, expected):
return actual.strip() == expected.strip()Good for: structured outputs, function-call shapes, classification.
Bad for: anything where the answer can be phrased many valid ways.
Rubric-based
You have a rubric: a list of criteria the answer must meet. The eval scores the answer against each criterion, often using an LLM as the grader. More flexible than reference-based, more expensive to run.
RUBRIC = [
"Mentions the specific failing test case",
"Identifies the file and line of the bug",
"Explains why the bug causes the symptom",
"Does not hallucinate code that doesn't exist",
]
def rubric_eval(actual, rubric):
return [judge(actual, criterion) for criterion in rubric]Good for: open-ended outputs, free-text answers, plans.
Bad for: anything where the rubric itself is hard to write down.
Behavioral
You don't grade the final output; you grade the process. Did the agent stay under N tool calls? Did it use the cheapest tool that would work? Did it escalate to a human when confidence was low?
def behavioral_eval(trace):
return {
"tool_calls": len(trace.tool_calls) <= 10,
"used_required_tool": "search" in [t.name for t in trace.tool_calls],
"stayed_in_budget": trace.tokens <= 5000,
}Good for: cost, latency, reliability properties.
Bad for: anything you care about in the answer itself.
A good eval suite mixes all three. Reference-based for the parts that have right answers; rubric-based for the parts that don't; behavioral for the operational properties.
Why running once isn't enough
If your agent is non-deterministic, one run tells you almost nothing. You need to run each eval case multiple times and aggregate. The aggregate metric matters more than any individual run.
def eval_with_replication(case, n=5):
results = [run_agent(case) for _ in range(n)]
scores = [score(r, case) for r in results]
return {
"pass_rate": sum(1 for s in scores if s.passed) / n,
"mean_score": sum(s.value for s in scores) / n,
}A "pass rate of 0.6" tells you the agent passes this case 60% of the time. That's much more useful than "this run passed" (which doesn't generalize) or "this run failed" (which might just be variance).
Variance is your enemy and your data
Variance across runs has two sources:
- Inherent stochasticity. The model's sampling produces different outputs.
- Real instability. The agent's behavior is fragile in ways that change outcome based on tiny prompt differences.
Both manifest the same way in metrics. To distinguish:
- Set temperature to 0 and re-run. If variance disappears, it was sampling. If it persists, you have real instability (e.g., the agent depends on a non-deterministic tool).
- Check whether failures cluster in specific cases or are random. Clustered = real bugs in those cases; random = sampling.
For final reporting, run with the temperature you'll use in production. For diagnosing, set to 0 to isolate.
Statistical confidence
If your pass rate goes from 0.60 to 0.65 between two builds, did you actually improve, or is that noise?
A binomial confidence interval helps. With 50 trials, a pass rate of 0.6 has a roughly 95% confidence interval of about ±0.14. So 0.6 → 0.65 with 50 trials is not statistically meaningful; you'd want 200+ trials per case to detect changes that small.
Don't treat eval scores as exact numbers. Treat them as ranges. A "pass rate of 0.62 ± 0.07" is more honest than "pass rate of 0.62."
Eval cases vs unit tests
Eval cases are different from unit tests in three ways:
| Unit test | Eval case | |
|---|---|---|
| Pass criterion | Exact match | Criterion / rubric / behavior |
| Number of runs | 1 (deterministic) | N (averaged) |
| Failure interpretation | Bug | Lower pass rate; might not be a regression |
| Right cadence | Every commit | Every nontrivial change |
You can't replace unit tests with evals; both have a place. Tests verify the deterministic parts (your code, your tools, your loop's plumbing). Evals verify the stochastic parts (the agent's outputs, the model's reasoning).
What evals catch that production logs don't
Production logs show what happened. Evals show what should have happened on a controlled set. Without evals you can't:
- Compare two prompts head-to-head.
- Detect that a new model version regressed your agent.
- Notice that a change to one tool degraded an unrelated capability.
- Catch problems that don't trip your reliability layer but are still wrong (the answers are valid but worse).
Production observability tells you when the wheels fall off; evals tell you when the steering wheel is off-center.
When you don't need evals
Some agent systems can skip formal evals:
- Tiny scope. A 2-tool agent that does one thing might be testable by hand with a checklist.
- Single-user demos. If the only judge is the user, ad-hoc testing is fine.
- Pre-product. Building evals before you have a working system is premature optimization.
Once you ship to real users, or once your agent gets nontrivial, evals become non-optional. Without them, every change is a gamble.
The eval set is the spec
A useful framing: the eval set is the de-facto specification for what your agent should do. If a case is in the eval set, it's a thing you've committed to handling. If a case isn't, it's not. As you find production failures, add them to the eval set; as users request new behaviors, add them. Over time, the eval set becomes the most accurate description of "what this agent does." Code is what implements; evals are what defines.
Key takeaway
Agent evals replace traditional tests for non-deterministic outputs. Each case has criteria (reference-based, rubric-based, behavioral); each case is run multiple times; aggregate pass rates with confidence intervals are the meaningful metric. Evals are not optional once your agent has real users. The next lesson is about building the eval dataset itself: what to collect, how to label, what makes a good case.
Done with this lesson?