Plausible is not correct
While testing the SRE agent I build in my book, I ran it through a simulated chaos day: five incidents in eight hours. It handled the easy ones well. It got the hard cascade wrong every single run, while sounding entirely confident. And the only reason I knew is that I was the source of truth, the scenario file literally said what the right diagnosis was.
In production, nobody hands you the truth. The incident is real, the agent's diagnosis reads beautifully, and you have no oracle telling you whether plausible is correct. Building that oracle is what an eval harness is for, and the lack of one is the single most-cited reason agent pilots die: evaluation gaps, named by 64% of teams in the surveys behind the 88% failure rate I keep referencing in this series.
The six weeks of vibes
Here's how it usually goes. A team builds an agent, tries a few dozen inputs by hand, the outputs look good, they ship. That's evaluation by vibes.. a small sample, chosen by the person who wants the agent to work, no record, no repeatability. It feels like testing because you looked at outputs. It isn't because you can't run it again and can't detect when it changes.
The agent runs in production and seems fine. Then six weeks in, the model provider ships a quiet update, or someone tweaks a prompt, and accuracy on some slice of inputs drops thirty percent. Nobody notices, because vibes don't run continuously, they run when a human happens to look. The degradation sits there invisibly until it's big enough to generate complaints.
The thirty percent drop is not the failure, the failure is that it was invisible. Regular software has tests, a change that broke thirty percent of behavior would turn a suite red before shipping. The eval harness is what produces that red for an agent.
Why you can't just write unit tests
Two difficulties, and pretending they don't exist is how teams end up with bad evals.
First, an agent breaks both assumptions a unit test relies on. It's non-deterministic, so the same input produces different outputs across runs and exact-match assertions will flap. And its outputs are open-ended.. there are a dozen good ways to phrase the right diagnosis and the agent will use a thirteenth. You can't assert equality against a golden string.
Second, an agent is a trajectory, not an input-output pair. The final answer can be right while the path was wasteful or wrong because of one bad step in an otherwise sound investigation. So you evaluate at two levels. Trajectory-level: did the whole run reach a good outcome by an acceptable path within acceptable cost? Step-level: freeze the agent's exact context at one decision point, replay just that decision, and score it. Trajectory evals tell you the agent failed the incident. Step evals tell you where: the conclusion step stops at the service that alerted instead of tracing one hop upstream. The first is true but not actionable. The second you can fix.
And because the agent is non-deterministic, run each scenario several times and report rates not single results. A scenario the agent gets right four times out of five is a different beast from one it gets right five out of five and one run can't tell them apart.
Score on three dimensions, separately
A trajectory score isn't one number it's a profile:
Correctness: did it reach the right diagnosis? This is the headline number.
Safety: did it ever propose a forbidden action? Track this separately because a wrong diagnosis is a miss but a forbidden action is a danger, and a dangerous-but-usually-right agent shouldn't get to hide its danger behind its accuracy.
Efficiency: steps, time, tokens. After a model update an agent can stay 93% correct but go from eight steps per incident to fourteen. Something degraded and a single accuracy number would never show it.
The judge can lie
For open-ended outputs like diagnoses the standard answer is LLM-as-judge: ask a model whether the agent's answer is equivalent to the reference. It works, it scales, and it's also a measurement instrument most teams never calibrate. Judges favor longer answers and answers phrased like their own. They score the same output differently across runs. They confidently declare a subtly wrong diagnosis equivalent to the right one because the two share surface features.
Two rules keep this from poisoning your numbers. First, validate the judge.. have humans label a sample, measure agreement, fix the judge prompt until agreement is solid, and re-check periodically. An unvalidated judge produces numbers that look like data which are not. Second, don't judge what you can assert. Forbidden action proposed? Set membership. Tool call well-formed? Schema check. Rules are deterministic and free and they don't lie. Save the judge for the few dimensions that have no cheaper check.
Your best test cases are your failures
The best source of eval scenarios is production itself. Every failure a human catches becomes a permanent test case: capture the incident, record the correct answer, add it to the set. The next time a change might reintroduce that failure, the eval catches it. This is exactly how normal software accumulates regression tests, one bug at a time, and an eval set that isn't growing is one that isn't learning from production.
Start embarrassingly small
You don't need a complete suite, you need a harness that exists before your next deploy. Pick five inputs where you know what the agent should do, production failures first if you have them. Decide how to score correctness, safety and efficiency for each, rules wherever possible. Run each scenario three to five times, record the rates, and that's your baseline. Next time you change a prompt or a model, run it again and compare.
The first time it catches a regression you'd otherwise have shipped, it has paid for itself many times over. A red number before a deploy instead of a complaint after one.. that's the entire point, and it's the thing six weeks of vibes never gave anyone.
This post is drawn from chapter seven of my book on production AI agents. Next post: why a harness a human runs when they remember is just better vibes, and how to make evals physically block a bad deploy.