What your agent isn't allowed to do

Two teams at the same company build an agent on the same data warehouse. Both demo well. One is still running a year later. The other gets turned off in three weeks. The difference isn't the model or the prompt, it's where each team drew the boundary.

The first team builds the irresistible pitch: ask any question about the business, the agent writes the SQL and answers. In the demo it's magic. In production, someone asks about "active customers" and the agent joins on a column that double-counts during plan changes. Nobody notices, because the number looks plausible. Someone asks about "this quarter" and gets calendar quarters while Finance uses fiscal ones, and two numbers diverge in a leadership meeting. The SQL is valid every time. The failure is that the agent's scope was "any question", and the warehouse has a thousand traps a human analyst learns over years.

The second team's agent answers questions about fourteen defined metrics, each backed by a hand-written query the data team already trusts. The agent doesn't write SQL, it picks the right metric, fills in the parameters, and presents the result. Off the list? It says so and offers the closest matches. Less impressive in the demo. Still running a year later, because every answer is one the data team stands behind.

Capability-first vs reliability-first

Capability-first design asks "what could this agent do?" and pushes the boundary out as far as the model allows. Reliability-first asks "what can this agent do correctly, every time, in a way I can verify?" and draws the boundary there.

Reliability-first wins, and here's the blunt reason. A non-deterministic system's value is not its peak capability. An agent that's brilliant on 80% of inputs and confidently wrong on 20% has negative value when you can't tell which 20% you're in, because the wrong answers cost you the trust that made the right answers worth having. Narrow the boundary until the confidently-wrong rate inside it is near zero and the value compounds: people stop checking its work, it graduates to autonomous. The capability-first agent never graduates, because nobody can stop checking.

This also explains a big chunk of the 88% pilot failure rate from earlier in this series. The most-cited blocker is evaluation gaps, and an evaluation gap is usually a boundary problem underneath. You can't evaluate an agent whose scope is "any question", because you can't enumerate the inputs or define correct answers. Draw a boundary and the eval set becomes finite and gateable.

Core, frontier, wilderness

Where does the boundary go? Not where the model stops producing an answer, it'll produce one for almost anything. It goes where the model stops producing a correct answer you can verify. Sort your agent's inputs into three zones:

Core: reliably correct and checkable. The fourteen metrics, the refund under fifty dollars, the read-only query. Runs autonomously.

Frontier: often correct, but the failure is silent or hard to verify. You can't tell a good answer from a plausible bad one without doing the work yourself. The agent proposes, a human approves.

Wilderness: regularly wrong, or catastrophic when wrong. Out of scope entirely, and ideally the agent recognizes it's there and says so.

One warning: don't sort by capability demonstration. "It answered this hard question correctly in testing" is a sample of size one from exactly the part of the distribution you need a large sample from. Frontier competence looks like core competence in a demo and reveals itself at production volume.

Making the task smaller is a design move

When an agent fails on the frontier, the instinct is a stronger model or a better prompt. Often the cheaper move is to make the task smaller: narrow the input domain, decompose the task and keep the reliable parts, replace generation with selection (the second team's move: pick from vetted queries instead of writing them), lower the stakes so a wrong answer is recoverable, or simply refuse at the boundary and route to a human.

None of these require a better model. Scope reduction is under your control and compounds. Model improvement is somebody else's roadmap.

And the most aggressive version: take the agent out of steps that don't need it. If a junior engineer with clear instructions could script a step to work correctly every time, write the script. Save the model for the irreducible judgment. The best production agents are mostly boring, reliable code with a small, sharp core of model-driven judgment. The agent is the seasoning, not the meal.

Try it on your agent

List every capability, one row each, and score three things: how often it's actually correct (be honest about whether that's a number or a guess), whether you'd catch a wrong answer before it caused harm, and how bad an uncaught one would be. Zone each row. Core runs autonomously, frontier gets a human in the loop, wilderness comes out of scope with a written escalation path.

Then the question that catches most designs: can the agent recognize the boundary from the inside? A boundary the agent can't detect isn't a boundary, it's a hope.

This post is drawn from chapter three of my book on production AI agents, where this boundary gets drawn for a working SRE agent and then enforced in code across the rest of the build. Next post: making the in-scope work durable, or why your agent should survive its own death.

Quoth the Agent: Nevermore

The most important design decision is what the agent is not allowed to do!

Capability-first vs reliability-first

Core, frontier, wilderness

Making the task smaller is a design move

Try it on your agent