Evaluation
Building eval datasets
Golden answers, rubrics, and judge models.
Video coming soon
What goes in the eval set
An eval pipeline is only as good as the cases it runs against. A bad set tells you nothing useful (or worse: tells you something confidently wrong). This lesson covers how to build an eval set that actually predicts real-world quality, what fields each case needs, and how the set evolves.
The pattern: start with a small handcrafted core, grow it with mined production data, label new cases as they appear, version the whole thing.
What a case looks like
Each case is a small structured object. A reasonable shape:
@dataclass
class EvalCase:
id: str
task_type: str # "bug_fix", "research", "summarize", etc.
input: dict # what the agent is given
expected: dict # what we expect (rubric, reference, etc.)
metadata: dict = None # tags, source, severity, owner
# For reproducibility
fixtures: list = None # any tool/state setup needed before runningexpected is where the eval logic lives. Three common shapes:
# Reference-based
{"type": "reference", "answer": "the bug is on line 42 of retries.py"}
# Rubric-based
{"type": "rubric", "criteria": [
"Identifies the file and line",
"Explains the root cause",
"Does not invent code that isn't there",
]}
# Behavioral
{"type": "behavioral", "checks": [
{"kind": "max_tool_calls", "limit": 8},
{"kind": "must_call", "tool": "read_file"},
{"kind": "must_not_call", "tool": "send_email"},
]}The case can have multiple expected entries; pass-criterion is "all expectations met."
Where cases come from
Three sources, each contributing different value:
Handcrafted (the seed)
Write 20-50 cases by hand at the start. Cover the obvious capabilities, the most important tasks, and the few failure modes you've already seen. The seed is small but high-quality and very useful for early iteration.
These cases tend to be optimistic: they're things you intend the agent to handle. They miss real-world weirdness.
Mined from production
The bulk of a mature eval set comes from real user requests. Periodically sample production traces, label which ones the agent handled well or poorly, and add the interesting ones to the eval set.
# rough mining flow
for trace in sample_production_traces(n=1000, sampler="stratified_by_task_type"):
if user_satisfaction_score(trace) < 0.5 or had_repeated_tools(trace):
add_to_review_queue(trace)
for trace in review_queue:
label = human_label(trace)
if label.is_useful_for_evals:
eval_set.add(EvalCase(
id=f"prod-{trace.id}",
input=trace.user_input,
expected=label.expected,
metadata={"source": "production", "captured_at": trace.timestamp},
))Production-mined cases reflect what real users actually ask. They're harder, weirder, and far more valuable than handcrafted ones for catching regressions.
Adversarial / synthetic
Cases generated to stress specific failure modes: prompt injection attempts, ambiguous requests, requests that need tools the agent doesn't have. Often generated by another LLM and lightly reviewed by a human.
These are useful for bounding "worst case" behavior. They tend to be unrepresentative of typical traffic, so weight their contribution to overall scores accordingly.
The 80/20 of high-value cases
Three properties make a case valuable:
Diagnostic
When a case fails, it tells you something specific. "The agent couldn't reason about cross-domain queries" is diagnostic; "the agent failed this test" is not.
Stable
The case behaves consistently. A case where 50% of runs pass and 50% fail is hard to use; you can't tell whether a build's regression is real or noise.
Representative
The case looks like real traffic, not just a contrived edge condition. A case mined from production almost always meets this; a hand-written one might not.
When pruning your eval set, drop cases that fail all three. Keep ones that score high on at least two.
Sampling strategy
If you have 10,000 candidate cases, you can't run them all on every change (cost, time). Sampling is the practical compromise.
A useful pattern:
- Tier 1: tiny core. 20-50 cases that run on every commit. Cheap, fast, blocks bad code. Primarily reference-based and behavioral, since rubric eval is expensive.
- Tier 2: regression set. 200-500 cases that run on every PR or nightly. Covers more ground. Mix of all three types.
- Tier 3: full set. 5000+ cases that run on releases or weekly. Comprehensive; catches subtle regressions. Mostly rubric and behavioral.
Tier 1 is the smoke test; Tier 3 is the deep eval. Each tier has different latency and cost expectations.
Label quality
If your expected values are wrong, the eval is wrong. Two failure modes:
Inconsistent labels across cases
The same kind of answer is judged "passes" in one case and "fails" in another. This usually means the rubric isn't tight enough, or different humans are labeling differently. Calibrate by having multiple labelers grade a small set; rewrite the rubric until they agree.
Outdated labels
The agent's correct behavior changed (you launched a new feature, deprecated a tool, updated policy) but the eval cases still expect the old behavior. Cases get marked "failing" when the agent is actually right. Track 1 dependencies (tools, prompts) should map to which cases need re-labeling when they change.
A small audit cadence (e.g., monthly review of a sample of cases for label freshness) keeps the set honest.
Versioning the eval set
Treat the eval set like code:
- Source of truth. A git repo or database, not a spreadsheet.
- Versioned. Tag releases of the set. When comparing two builds, compare on the same eval-set version.
- Reviewed. Adding or modifying cases goes through PR review.
- Audited. Track who added each case, when, and why.
Without versioning, you can't tell whether a "regression" is from a code change or an eval-set change. Both can move scores; you have to disambiguate.
What to NOT eval
Some things look like good eval cases but aren't:
- Real-time data. Cases that depend on "current weather" or "today's news" go stale and start failing for unrelated reasons. If you must include them, mock the relevant data sources.
- External services that flap. Cases that depend on a flaky external API will give you noisy eval scores. Mock the calls or skip the case.
- Subjective taste. "This summary is well-written" is hard to evaluate consistently. Either get specific (under 200 words, mentions all three points) or move it out of the formal eval set.
- PII or sensitive data. Eval sets get logged, shared, and shipped. Don't include user-identifiable data without redaction.
How big the set should be
Bigger is better, up to a point. Practical guidance:
- Less than 50 cases. You'll have high-variance scores. Useful for early iteration but don't trust small differences.
- 100-500 cases. Good for typical product-stage agent. Detects meaningful regressions; tolerable cost per run.
- 1000+ cases. Comprehensive. Detects subtle changes. Run nightly, not per-commit.
Don't try to start at 1000; you'll spend months building the set instead of improving the agent. Start at 30-50, ship, mine production traces, grow the set in the background.
Maintenance is a feature
Eval sets rot. Tools change, prompts change, expected behavior changes. Allocate explicit time to maintain the set:
- Add cases for every reported bug. A bug that escapes is a missing eval case.
- Update expected values when behavior changes intentionally. New feature → new expected behavior in relevant cases.
- Retire cases that no longer pull weight. A case that's been passing 100% of the time for 6 months can be deprioritized in tier 1; move to tier 2.
If your eval set hasn't changed in months, either your agent is in maintenance mode (fine) or your eval set has rotted into uselessness (less fine).
Mine traces aggressively in the first weeks
The single best thing you can do for an agent's eval set in its first month of production is sample a representative slice of traces every day and label them. This is unglamorous work that pays off forever: the more production-grounded your eval set is, the better your future iteration cycles will be. Don't skip it because it's tedious; the alternative is iterating on hand-written cases that don't represent real users.
Key takeaway
Eval cases need: a clear input, an expected definition (reference, rubric, or behavioral), and metadata for sourcing. Build a small handcrafted seed, then aggressively mine production traces. Tier the set by cost (smoke test, regression, full eval). Version it like code; audit labels. Don't put real-time data, flaky externals, or subjective taste in the formal set. The next lesson covers actually running the eval pipeline once you have the set.
Done with this lesson?