Evaluation
Automated eval pipelines
LLM-as-judge and programmatic checks.
Running the eval set
You have a set of cases (lesson 2). You have ways to score them (lesson 1: reference, rubric, behavioral). This lesson is about the pipeline that takes a build of your agent and produces a usable signal: pass rates, regressions, dashboards, alerts. The mechanics matter because evals only help if they actually run, fast enough, frequently enough, on the right things.
What the pipeline does
A typical pipeline:
- Pick the cases. Filter by tier (smoke, regression, full).
- Run the agent. For each case, execute it N times against the build under test.
- Score each run. Apply each
expectedentry to produce a per-criterion score. - Aggregate. Compute pass rates, mean scores, confidence intervals.
- Compare. Diff against a baseline (last release, last main, last week).
- Report. Surface the diff to humans (dashboard, PR comment, alert).
The complexity is in step 5 and 6. Steps 1-3 are mostly mechanical.
Running cases efficiently
Evals are embarrassingly parallel: each case is independent. Run them concurrently.
async def run_eval_set(cases, agent, replicas=3):
tasks = []
for case in cases:
for _ in range(replicas):
tasks.append(run_one(case, agent))
runs = await asyncio.gather(*tasks)
return aggregate(runs)For a tier-2 set of 300 cases run 3 times each, that's 900 agent invocations. Concurrency matters: if you run them serially at 5 seconds each, that's an hour and 15 minutes. With concurrency limited to 50, it's 90 seconds.
Cap concurrency to avoid hammering downstream services. Most production-eval pipelines run with 20-50 concurrent agents.
Judges and graders
For rubric-based scoring, you need a "judge" model that reads the agent's output and the rubric, then scores each criterion. The judge is itself an LLM call.
async def judge(output, criterion):
prompt = f"""Grade whether the following agent output satisfies the criterion.
Criterion: {criterion}
Agent output:
{output}
Respond with PASS or FAIL and a one-line reason."""
response = await model(prompt, temperature=0)
return parse_judge_response(response)Three things to know about judges:
Use a different model than the one being evaluated
If you grade GPT-4 with GPT-4, you bias toward GPT-4-shaped answers. A different family (Claude judging GPT, or vice versa) usually produces more balanced grades.
Run with temperature 0
Determinism matters for the judge specifically; you want the same output graded the same way every time.
Calibrate against human labels
Spot-check the judge against human grading on a sample. If they disagree more than 10-15% of the time, your rubric is too vague or your judge prompt needs work.
Aggregation
For each (case, criterion) pair you have N scores. Aggregate:
- Pass rate: fraction of replicas that passed.
- Mean score (for graded criteria): average across replicas.
- 95% confidence interval for both.
Then aggregate across cases for an overall metric. A simple "overall pass rate" is the average of per-case pass rates, which assumes equal weight per case. For weighted suites (some cases matter more), use weights:
def weighted_score(case_scores, weights):
return sum(w * s for s, w in zip(case_scores, weights)) / sum(weights)Most teams start unweighted; introduce weights when you have a reason.
Comparing builds
The overall score is interesting; the delta against a baseline is what tells you whether to ship.
def diff_runs(current, baseline):
return {
"overall": current.overall - baseline.overall,
"by_task_type": {
t: current.by_task[t] - baseline.by_task.get(t, 0)
for t in current.by_task
},
"regressions": [
c for c in current.cases
if current.cases[c].pass_rate < baseline.cases.get(c, baseline.default).pass_rate - 0.1
],
}The regressions list is the most important output. "Overall is up 0.02 but cases X, Y, Z regressed by 0.3" is much more useful than "overall is up 0.02."
Where to run the pipeline
Three integrations:
Per-PR
Run tier 1 (smoke) on every PR. Maybe tier 2 if the PR touches the agent core. Block merge if regressions exceed a threshold.
Nightly
Run tier 2 on the latest main. Post results to a dashboard. Flag regressions for someone to look at.
Weekly / per-release
Run tier 3 (full set). Comprehensive picture; catches things tier 2 missed. This is the eval that gates a release.
Each tier corresponds to a cadence and an audience. Tier 1 fails block engineers; tier 3 results are reviewed by a quality team or product owner.
Cost and time budgets
Eval pipelines are expensive. Track:
- Cost per run. One full pass of tier 3 might cost $50 or more in LLM tokens.
- Time per run. Tier 1 should be under 5 minutes; tier 3 can take hours.
- Cost per detected regression. This is the metric that justifies the spend.
If your tier 3 costs $100 per run and catches one regression a month, that's $100/regression. If your engineers would have shipped it anyway and lost a customer, that's a great trade. If they would have caught it manually, it's wasteful.
Budget evals like you'd budget any production system.
Eval pipeline observability
Same observability discipline as the agent itself:
- Per-case latencies and costs. Some cases may be 10x more expensive than others; investigate.
- Per-judge agreement rates. When two judges disagree, log it.
- Flaky case detection. Cases whose pass rate has high variance across runs need attention; either fix the case or fix the agent.
Treat the eval pipeline as an internal product. Maintain it. Iterate on it. The team that hates running evals soon stops running them.
Catastrophe modes
A few specific failures worth noticing:
A judge that's always agreeing
If your judge passes 95% of cases, either your agent is amazing (unlikely) or your judge is rubber-stamping. Stress-test by feeding it bad outputs and seeing if it correctly fails them.
A judge that's always disagreeing
The opposite. If pass rates collapse after a judge upgrade, suspect the judge before suspecting the agent.
A new model release that craters scores
When the underlying model updates, eval scores can move. Sometimes this is real regression; sometimes the judge's calibration shifted with the model. Don't blindly trust score changes after model upgrades; recalibrate.
Eval cases that are wrong
A case whose expected is mislabeled passes when the agent does the wrong thing and fails when it does the right thing. Tracking cases as having an ownership and a freshness metadata helps catch these.
What this enables
With a working eval pipeline:
- Every prompt change can be tested before deploy.
- Every model upgrade can be evaluated head-to-head against the current model.
- Every reported bug becomes a regression test (after labeling).
- Iteration speed goes up because you have an objective signal.
Without one, every change is "I think this is better."
Don't tune to the eval set
A risk worth flagging: if your team only optimizes against the eval set, you'll overfit to it. The agent gets great at the cases you've measured and worse at the ones you haven't. Mitigate with: (a) keeping mining production data so the eval set keeps growing with reality; (b) periodically running blind evaluations where the agent is graded on cases it has never been tested against; (c) holding out 10-20% of cases as a "validation" set that's checked rarely.
Key takeaway
The eval pipeline is what turns the eval set into a usable signal: it picks cases, runs them with replicas, scores them with a judge model, aggregates, and diffs against a baseline. Use a different model as the judge; calibrate against humans; report regressions specifically, not just overall scores. Run tier 1 per PR, tier 2 nightly, tier 3 per release. Treat the pipeline as a real product. The next module switches to the production-system observability that complements evals.
Done with this lesson?