Your agent should survive its own death

Here's the agent control flow almost everyone writes first.

state = initial_state(task)
while not done(state):
    next_step = model.decide(state)
    result = execute(next_step)
    state = state + result
return state

Look at the state, ask the model what to do next, do it, append the result, repeat until done. A while loop with a model call inside. It's correct as a description of what an agent does. It's also a single point of failure wearing a for-loop.

Walk through what production does to it. The process running the loop is, well, a process. It gets deployed over, OOM-killed, or crashes because step four threw an exception nobody caught. When that happens at step six of a ten-step task, the in-memory state evaporates. The first six steps ran. They had effects. Maybe they posted a message, maybe they charged a card. None of that is undone, but the loop has no memory it happened. Restart the task and it does those six steps again.

None of this shows up in a demo because the demo runs to completion in one process on one happy path. It shows up on day three when a deploy rolls mid-run and a customer gets charged twice and you discover your agent's control flow has none of the properties your payment system spent a decade building.

The fix is not a more careful loop it's two structural moves.

Split the decider from the doer

The single most important structural decision in agent architecture, separate the part that decides what to do from the part that does it.

The orchestrator owns control flow. It holds task state, decides the next step, records that it happened. The important point is it never touches the outside world. It doesn't run the query or charge the card, it decides that those things should happen.

The executor runs one step. It gets a decision ("run this query"), performs it against the real system, handles the timeout and the retry, returns the outcome. It's deliberately stateless about the overall task.

Why bother? Because the two halves have opposite needs. The orchestrator needs to be durable and deterministic. The executor does messy side-effecting I/O against systems that fail. Fuse them and you get a component that must be durably replayable and does non-replayable side effects, a contradiction you can't engineer your way out of. Split them and each half can be built right.

There's a security bonus too! In an agent, "decide" is a model call and "do" is a tool call. With the split the model only proposes, and a separate controllable component executes. Without it every action is one hallucination away from happening.

Make the orchestrator durable

The second move has a name in the infra world, durable execution. Every decision and every step result gets appended to a durable log before anything proceeds. The workflow's state is the result of replaying that log. When the process dies, a new one replays the log, reconstructs the exact state at the moment of death and continues from step seven. Steps one through six are not re-executed because the log says they happened and what they returned.

Temporal is the mature option here (alternatives Inngest, Restate and DBOS) and you can hand-roll a minimal version with a workflow table, a step-results table and a worker loop. The engine matters less than the discipline: no clock reads, no randomness, no I/O in the orchestrator. All of that goes through the executor and gets recorded.

Which raises the obvious objection, the model is the least deterministic thing in the system, how does it live inside a deterministic replay? You record its answers. The first run consults the oracle, every replay reads the recording. This also means a crashed investigation resumes with the agent's earlier reasoning intact instead of paying the model to re-decide and possibly diverge. You don't make the model deterministic. You record what it said and replay the recording.

The double charge

Durable execution alone doesn't prevent the double charge, and it's worth seeing exactly why.

The orchestrator decides "charge the card". The executor sends the charge, the provider processes it and the card is charged, but the response gets lost on the way back.. network drop, timeout, process death, take your pick. From the orchestrator's view, the step never recorded a success. So on retry it decides "charge the card" again and now the customer has paid twice.

When a call times out you genuinely do not know whether it succeeded. Any retry strategy that assumes timeout means failure will double-act whenever timeout actually succeeded downstream.

The answer is idempotency. Give every side-effecting action a key derived from durable facts, the workflow ID and the step position, never a fresh UUID or a timestamp. The same logical step always produces the same key, so the downstream system recognizes the retry as a duplicate and returns the first result instead of acting again. The customer pays once no matter how many times the executor retries. And notice the key is a natural artifact of the split, the orchestrator already knows the workflow ID and step index. Idempotency isn't bolted on, it falls out of the architecture.

Four questions for your agent

Run your agent or your design through these:

One, where does your state live? If the answer is "in memory", a crash loses it.

Two, what happens when a side-effecting step retries? If the answer is "the effect happens twice", you have a double charge waiting for a timestamp.

Three, can the model's output directly cause a side effect with no separate executor in between? If yes, your planner and doer are fused and both durability and security get harder.

Four, what's your resumable unit? If it's "the whole task", a crash redoes the whole task. Push it down to individual tool and model calls.

Answer all four cleanly and your control flow will survive production. Any uncomfortable answer is the thing to fix before the agent takes a single consequential action in front of a real user.

This post is drawn from chapter four of my book on production AI agents, where this gets built into a working SRE agent you can kill mid-investigation and watch resume. Next post: the eval problem, or why a confident diagnosis and a correct one look identical until you hold the ground truth.