An eval you run manually is just better vibes

Last post was about building an eval harness. This one is about a more frustrating failure.. the team that built the harness and got burned anyway.

Here's how it goes. The team has a good eval suite, it scores the agent across scenarios, everyone's proud of it. But running it is a manual step and the result is a report someone reads. One week, under deadline an engineer makes a prompt change that improves the cases they're focused on and quietly regresses a category they aren't thinking about. They skip the suite because it takes twenty minutes and the change is "obviously fine". Or they run it and skim past the one red number buried in a wall of green. The change ships. The regression is live. The suite that would have caught it sat right there unrun or unread.

This is worse than having no evals because the team did the hard part and got no protection at the moment it mattered. The lesson.. an eval suite's value is not in existing, it's in being unbypassable at the moment of deploy. Regressions don't restrict themselves to convenient times.

Make it a gate

The model to steal is CI. Nobody relies on engineers remembering to run the tests before merging, the pipeline runs them and blocks the merge on red. Same pattern, the eval suite runs automatically on every change to the agent and the deploy is blocked if the evals regress. The engineer doesn't choose to run it and can't choose to skip it.

Three decisions make this work for a non-deterministic system and none of them come up in normal testing.

Compare against a baseline not a standard. The right question isn't "did it pass", it's "did it get worse than what's running now". Keep the eval profile of the currently-deployed agent as the baseline and compare each candidate to it. When a candidate passes and deploys it becomes the new baseline, so the bar never quietly drops. One trap here if the new baseline is just the candidate's gate-run numbers, a lucky run gets enshrined as the bar and future fine candidates get blocked for missing a number that was never real. Use a rolling baseline over the last several versions or re-measure fresh over enough runs to wash out the noise.

Set thresholds from measured noise. A candidate's rate will differ from the baseline just from run-to-run jitter so a gate that blocks on any decrease blocks constantly on nothing. Run the identical agent several times, measure how much the numbers move on their own, and block only when a candidate falls outside that band. This is why the last post insisted on rates over single runs.. you need the variance to know what's jitter and what's real.

Gate dimensions differently. A safety regression, the agent newly proposing a forbidden action, blocks hard with a tight threshold. A small efficiency regression warns instead of blocks, it's a cost issue not a danger. Correctness sits in between. This is the payoff of scoring correctness, safety and efficiency separately.. they can gate separately.

With those three set the gate is mechanical. The buried red number now stops the deploy instead of being skimmed past, because the pipeline reads the number not a tired human under deadline.

The gate isn't enough

Here's the catch: the agent can degrade without any deploy at all. The model provider ships a new version under the same API. The input distribution shifts. The agent's memory accumulates and drifts. The deploy gate never sees any of this because nothing was deployed.

The answer is production sampling: score a small slice of real production runs continuously and watch the trend. The awkward part is that production runs don't come with ground truth.. if you knew the correct diagnosis you wouldn't need the agent. Three workarounds. Use a reference-free judge, scoring against a rubric ("is this diagnosis supported by the evidence gathered") instead of a golden answer. Use outcome signals that arrive later, did the fix actually work, did a human approve or reject. And route a small sample to human review, expensive per run but cheap at a low rate.

Watch the trend, not the level. One low-scoring run is noise. A week of scores bending down is decay, and the bend is visible well before users complain. And the low scorers do double duty: every bad production run is a candidate test case for the suite, which is how the eval set keeps learning.

Between them, the two mechanisms cover everything. Degradation either comes from a change, which the gate catches, or from the world, which sampling catches. There's no third source.

Let people override it, loudly!

A gate that can never be overridden becomes a gate people disable, and a disabled gate protects nothing. There are legitimate reasons to ship past a red gate, the eval itself is wrong, the regression is in a scenario that no longer matters, it's an emergency fix. So allow the override, but never silently. Shipping past red requires a recorded decision with a name and a reason attached.

This strengthens the gate rather than weakening it. The opening regression shipped because skipping was passive, the path of least resistance. Make the bypass explicit and the same engineer doesn't ship it, because the moment you have to write down "shipping a known safety regression because I'm in a hurry", you don't. And watch what gets overridden, frequent overrides on one scenario mean the eval is broken not the agent.

Start with one gate

You don't need the full apparatus. Pick the one dimension where a regression hurts most, usually safety. Baseline it, measure the noise band, wire the check into the path your deploys actually take, and add the logged override. One real gate protects more than a perfect harness nobody's required to run.

This post is drawn from chapter eight of my book on production AI agents. Next post: the thing all of this measurement quietly depends on.. traces, and why observability solved the incident that opened this series not intelligence.