The trace solved it not the intelligence

Go back to the 02:47 incident that opened this series. The thing that let the on-call engineer resolve it in forty-four minutes wasn't intelligence. It was a trace. They opened a single failed checkout saw the payment call return a slow success inside an aggressive timeout and the whole shape of the incident fell out of that one reconstructed request.

Your agent needs the same thing for itself. When it fails on step six of a long investigation the only way to answer why is to reconstruct what it did, step by step, the way the engineer reconstructed the checkout. Without that every agent failure is a shrug.

The four-hour shrug

Every team running an agent in production hits this. The agent does something wrong. Someone goes to investigate and discovers they can't. The record of what the agent did is a handful of log lines that say "calling tool", "got response", "decision made". None of it captures the content.. which tool, what arguments, what came back, which decision and why. The reasoning that connected the steps lived in the model's context at runtime and was never written down.

The investigation that should take ten minutes takes four hours of reconstruction from fragments, and often ends with "we think it was something around here" instead of a root cause. Worse, a failure you can't reconstruct is a failure you can't capture as an eval scenario which breaks the failures-become-tests loop from the last two posts at the first step. A poorly instrumented agent makes its own failures unlearnable.

What a real trace contains

The fix is to capture every run completely from the start. Observability you add after a failure can't reconstruct the failure that prompted you to add it. "Completely" means a few things people routinely skimp on:

The inputs, exactly. Not a summary, the actual trigger, because failures hide in the specific thing a summary smooths over.

Every model call in full.. the complete prompt as sent and the complete response as received. This is the single most skimped piece and the most important. A trace that records "model decided to check deploys" instead of the actual prompt and completion has thrown away the one thing you need when the decision was wrong.

Every tool call with its exact arguments and full result, plus whether it succeeded, failed, fell back, or came back partial. And the timing and token cost of each step, which are both raw material for cost work and failure signals in their own right.

The discipline is completeness over convenience. Summaries always drop the one detail the next failure needs. Important Note: redact secrets and customer data at the boundary as you write the trace. Complete in structure with sensitive values masked. Completeness means never dropping the detail that explains a decision not hoarding secrets you'd never want in a log.

Capture the why not just the what

A trace of what the agent did is necessary but not sufficient because agent failures are frequently failures of reasoning not mechanics. The tools all worked, every call succeeded, and the agent still reached the wrong conclusion.

A mechanical trace of that is baffling.. a sequence of successful steps ending in a wrong answer. A semantic trace one that also records the agent's hypothesis and its reason for each step, shows the agent forming a wrong hypothesis at step two, then gathering evidence that fit it while ignoring the signal that contradicted it. That you can act on. Getting it costs the agent a few tokens to state its thinking and it has a side benefit: an agent that has to articulate its hypothesis tends to reason more carefully the same way a person explaining their thinking catches their own errors.

One more structural point: make the trace a tree not a flat list. The run is the root span, each step a child, each model and tool call a span under that. "Why did it fail on step six" becomes opening the step-six span and reading its subtree instead of scanning a forty-step log. Build on OpenTelemetry so your agent's traces live in the same place as the rest of your systems. Nice symmetry in an SRE agent, it reads service traces to do its job and emits its own traces so we can trust it. Same stack both directions.

Try it cold

Here's the test and most teams fail it the first time. Pick a run your agent did yesterday that you weren't watching. Using only what your system recorded reconstruct what it did and why end to end. For each step can you see the full input, the full output, the tool call and its result, the timing, the cost.. and the hypothesis behind the decision? Then pick the most surprising thing the run did and explain it from the trace alone without rerunning and without reading the source.

If you're guessing or re-running, you just found a hole in your instrumentation, on a run that didn't matter instead of during an incident that does.

This post is drawn from chapter nine of my book on production AI agents. Next post: the number that's been sitting in every one of those traces, unread.. cost.