Death by a thousand tokens

This failure looks exactly like success right up until someone does the arithmetic.

A team builds an agent. It works. The evals are green, the traces are clean, the demo lands, the pilot goes out to a few hundred users. Then someone multiplies. Each session, a full multi-step interaction, costs around four dollars in model calls. At a few hundred sessions a day, that's a line item nobody scrutinizes. But the plan is to roll it out to the whole customer base.. tens of thousands of sessions a day, four dollars each, every day, forever. The triumph at pilot scale is an economic impossibility at production scale, and the gap was invisible until someone reached for a calculator.

The agent isn't bad. It's unaffordable, which is its own way of not being production-ready. And the fix isn't to make it worse, it's to make it cheaper at the same quality. Most of that comes from three moves that never touch the agent's logic.

Find where the tokens actually go

You can't engineer a cost you haven't profiled, and your traces already hold the profile, because you recorded the token count on every model call. Reading them almost always shows the cost is concentrated, not spread evenly, and the concentration tells you which lever to pull.

The dominant cost in most agents isn't the number of model calls, it's the size of the context multiplied by the number of calls. An agent accumulates context as it works.. the conversation, the evidence, the steps so far, and tends to send the whole thing on every step. So the tenth step sends ten steps' worth of context, and a long investigation gets expensive in a way that feels quadratic. The late steps dominate, because they carry the most context.

Two other concentrations show up. Using your most capable model for every step, including the trivial ones, is pure waste on the easy steps. And the runaway, the session that goes long because the agent got lost and looped, is rare per-session but lives in the tail of the cost distribution. Context size points at caching. Model choice points at routing. Runaways point at budgets.

Cache the repeated context

The biggest lever for most agents is prompt caching because it attacks the dominant cost directly. Across the steps of one run most of the context is identical.. the system prompt, the tool definitions, the incident description, the evidence so far. Sending it fresh every time means paying to reprocess the same tokens over and over. Prompt caching lets the model cache a context prefix, so later calls pay a fraction for the cached part and full price only for the new tokens. That's often the whole difference between the four-dollar agent and a sub-dollar one, with zero change to what the agent does.

The catch: caching works on prefixes so you have to order the context stable-to-volatile. System prompt and tool definitions first since they never change, then the incident context, then the accumulating evidence, then this step's new content last. Put something volatile near the front and you break the cache for everything after it. Small design constraint large payoff easy to get wrong by accident.

Route the easy steps to a cheap model

The steps of a run aren't equally hard so they don't all need the same model. The judgment, deciding the next move, forming a diagnosis from ambiguous evidence, is where the expensive model earns its price. But formatting a result, extracting a field, a routine classification.. a much cheaper model nails those identically, and paying premium prices for them is waste, often a large slice of the bill because the easy steps are numerous.

The reliable default is structural routing: clearly-easy step types go to the cheap model, judgment steps stay on the expensive one. Worth noticing, if you can't cleanly separate judgment from mechanics to route them, that's usually a sign your scope was never properly bounded. A routing problem can be a boundary problem in disguise.

Give every session a budget with teeth

Caching and routing lower the average. Budgets bound the worst case. The four-dollar agent had cost metrics, what it lacked was a ceiling that stopped the twenty-dollar runaway while it was happening. A real budget is checked as the agent spends not reported after. As the agent nears the ceiling it wraps up with its best current hypothesis, or escalates to a human, rather than spending through the limit. "Here's what I found within budget, a human should take it from here" beats both a session killed mid-step and a surprise on the invoice.

One more dial.. where a human is waiting favor latency even at higher cost. Where nothing is waiting favor cost. Most agents have both kinds of interaction and need both settings.

Profile from your traces, order context stable-to-volatile, route the easy steps to cheap models and put an enforced ceiling on each session. None of it changes what the agent concludes. It just makes the agent shippable instead of excellent-and-unaffordable.

This post is drawn from chapter ten of my book on production AI agents. Next post: the hostile log line, and the security surface agents have that ordinary services don't.