Your agent demo is not a product

The demo went great. The agent took a vague request, picked the right tools, chained six steps together, and produced an answer that made the room go quiet for a second. Somebody said "ship it" and only half meant it as a joke. You've been in that meeting. I've been in that meeting.

Here's the thing nobody says in that meeting: the distance between that demo and a production system is not a few sprints of hardening. It's most of the work, and it's work that looks nothing like the work that produced the demo.

What the pager teaches you

Imagine it's 02:47 on a Tuesday and your phone buzzes itself off the nightstand. P1. Payments success rate dropped from 99.2% to 71% in fifteen minutes. You're up, checking dashboards, recent deploys, errors by region, the support queue. No deploys in six hours. You open a trace from a failed checkout and see the payment provider returning a slow success, 200 OK in 4.1 seconds, inside a 3 second timeout. Three minutes in, you have a hypothesis: the provider got slow, your timeout is too aggressive, you're treating slow successes as failures. You push a config change, watch the success rate climb back through 95%, and you're in bed by 03:30.

Nothing about that incident required brilliance. It required a system you could interrogate. Traces that reconstruct what happened. A hypothesis you could test. An action you could take and watch. A way to know within minutes whether you were right.

That's the bar for production software: you'd put it on the pager. You expect it to handle traffic it has never seen, on a day it has never seen, against a downstream that just changed under it, and either do its job or fail in a way you can recover from.

Now hold your agent demo up against that bar.

The 88%

Across 2025 and 2026 enterprise surveys, roughly 88% of AI agent pilots never graduate to production. The main blockers cluster around evaluation gaps first, governance friction second, model reliability third.

The interesting thing about that number is not that it's grim, it's that the blockers are boring. Nobody's pilot died because the model wasn't smart enough to do the job in the demo. The pilots die because nobody can say with evidence how often the agent is right on real traffic. Because nobody can list the consequential actions it can take and who approves them. Because a run that crashes on step six leaves the world in a state nobody planned for.

Those are engineering problems, and old ones mostly. They just don't get solved by making the demo more impressive, which is unfortunately where most of the effort goes.

Five questions your agent has to answer

When I say production-grade, I mean an agent that has an answer for five questions. Not a vibe, an answer, the kind you could defend on a whiteboard with data.

Is it correct under realistic load? Not on the eval set you wrote in week two. On the inputs real users send, including the hostile and malformed ones. Correctness has to be measured, not asserted.

Is it recoverable after partial failure? A multi-step run that dies on step six can't leave you with a charge but no order. Either it resumes from a checkpoint, or it can be safely abandoned. There is no third option that ships.

Is it observable from the outside? When something goes wrong, someone who didn't write the agent has to be able to answer why without reading the source.

Is its cost predictable? An agent that costs a lot per support ticket doesn't ship. Neither does one whose cost depends on user input in ways you can't model.

Is it governable? Are all the actions scoped? Is there an audit trail? Is there a way to revoke them? A binary on/off switch on a system that takes consequential actions in the world is not governance.

A demo answers none of these. A demo is a single trajectory through a system, on a friendly input, on a good day, witnessed by people who want it to work. That's not a knock on demos. You need them to get the project funded. The trap is mistaking the demo for an early version of the product. It isn't. It's just an argument that the product is worth building.

The posture that separates the 12%

Teams are shipping agents that support triage, code review, incident response, all running on real traffic with real users. When you look at what those teams did differently, it's rarely a smarter model or a cleverer prompt.

The teams that stall treat the model as the product and the agent as a thin wrapper around it. If the model gets smarter, the agent gets better; if the agent fails, it's the model's fault.

The teams that ship treat the agent as a system, where the model is one component out of eight or so, alongside the orchestrator, the state store, the tool layer, the evals, the observability, the guardrails, and the rollout surface. Then they engineer all of it with the same discipline they'd apply to any other system that's allowed to wake them up at night.

That sort of work is unglamorous. It means spending weeks on an eval harness while a competitor demos something flashier. It means launching in shadow mode and letting the agent earn each new permission. It also happens to be the only thing that has worked.

Where this is going

I'm working on turning this argument into a book: a working SRE agent that handles incidents like the 02:47 above, including the unglamorous parts, the eval harness, the deploy gates, the rollout from shadow mode to actually acting on its own.

This post is the first in a series working through the ideas in it. Next up: the eight components every production agent needs, and the failure mode each one exists to prevent.

In the meantime, here's a cheap test you can run today. Take the agent you're building and ask it the five questions above. Score yourself one to five on each, where one means "no answer" and five means "I could prove it with data." If you're below three on more than two of them, you don't have a production system yet. You have a demo. The good news is that the path from one to the other is mostly known engineering, and that's exactly what this series is about.