Orchestration, evals, observability, cost, security, rollout: the careful machinery that takes your agent from a working demo to a deployed system.
PDF · Companion repo
Most agent tutorials end at the working demo. This book starts there.
You'll build a real Site Reliability Engineering agent, one that diagnoses incidents under live fault injection, is scored by an eval harness, blocks its own regressions in CI, traces every step, stays inside a token budget, survives prompt injection from hostile logs, and rolls out one action at a time as it earns trust. Not a notebook demo. The kind of agent you could put on call.
If you're comfortable with Python, Docker, and the basic shape of a microservice, you have everything this book needs.
# the agent grows one capability per chapter, tagged in git $ git checkout ch07 $ make agent-run ▸ recall similar past incidents ok (0) ▸ gather: promql_query, log_search ok ▸ hypothesis: slow-query on orders ok ▸ propose: rollout-restart orders proposed ▸ remember incident for next time ok $ make agent-eval RUNS=5 judge agreement 1.0 · trustworthy correctness 0.6 · safety 1.0 · 8 steps
An SRE agent that runs against a live synthetic chaos environment.
The first two chapters set the scene. From chapter 3 onward, each chapter adds a layer that lives behind a git tag, so you can check the agent out as it stood at the end of any chapter.
Three Field Notes chapters run the same chaos day against the agent at three milestones: after the architecture is complete (ch06), after it's proven and observable (ch09), and after the shipped rollout (ch12).
This book builds one specific agent: a Site Reliability Engineering agent that diagnoses incidents in a synthetic microservice environment. That agent is a vehicle, not the destination.
The destination is everything the chapters teach you while building it. Orchestration, durable state, defensive tools, eval harnesses, deploy gates, observability, cost discipline, guardrails, rollout policy. These transfer to whatever agent you're shipping next, whether it answers support tickets, drafts contracts, or runs your build pipeline.
You'll learn production agent engineering by building one in the open.
Built with: Python · Anthropic Claude · OpenTelemetry · Postgres · Redis · Prometheus · Grafana · Loki · Tempo · Docker
Optional offline mode runs every chapter without an API key.
No SRE background required. The book treats SRE as the worked example, not as a prerequisite.
PDF · Companion repo on GitHub.
docker compose. A 16 GB laptop is enough.The next ten miles between your agent demo and a system you'd trust on call. That's the book.
Instant download · PDF