Recursive DevOps

DevOps was always a feedback loop. The first version of it ran between people. The version worth building now runs the system back on itself, so that every failure and every change feeds forward into a system that is cheaper and safer to change next time. The recursion is the whole compounding mechanism, and most teams never turn it on.

The longer I work inside the DevOps thesis, the more I think the original idea was both completely right and quietly mislabeled. It got remembered as a thesis about automation. It was really a thesis about loops. And once you see it as a thesis about loops, an obvious question appears that the industry mostly did not ask: a loop between what and what, and how many times does it run.

This is the answer I have arrived at. The most valuable thing you can do in operations is not to automate the work. It is to close the loop on the system itself, so the system becomes the thing that improves the system. I have written the broad version of why systems thinking is the part that compounds in DevOps Beyond Automation. This is the narrower, sharper claim underneath it.

The loop was the original idea

The original DevOps argument was simple and correct. Developers and operators were optimizing against each other, the friction between them was the single largest source of incidents, and the cure was to put them in the same loop. Deploy small changes often. Automate the path from commit to production. Measure what matters. Share responsibility for the running system, not just the code that produced it.

Every word of that is still true. But notice the shape of it. The cure was not a tool. The cure was a feedback loop. Put the people who write the change and the people who run the change into one circuit, so that the consequences of a decision reach the person who made it, quickly, while they can still do something about it. That is the actual invention. The pipelines and the dashboards and the alerting were just the wiring that made the loop fast.

When you see the loop as the invention, the tools stop being the point. A loop is a pattern, and a pattern can be applied to more than the thing it was discovered on. The first DevOps loop closed between two groups of humans. The interesting move is to ask what else you can put inside a loop, and what happens when the loop starts feeding into the very thing that produced it.

From a loop between people to a loop on the system

Here is the shift. The original loop ran from the system to a human and back: the system did something, a human saw it, the human changed the system. That is already valuable, and it is most of what teams build. But the human in the middle is a bottleneck, and more to the point, the human keeps having to learn the same lesson because nothing about the system itself changed to remember it.

A recursive loop closes the circuit one level deeper. The system does something, a human sees it, and then the human changes the system so that the system itself now handles that case, or makes it impossible, or surfaces it earlier next time. The output of one pass is a permanent change to the thing that will run the next pass. The loop is no longer just operating the system. It is reshaping it, and each reshape changes what the next reshape starts from.

That is what I mean by recursive DevOps. Not infinite cleverness. A discipline of always closing the loop one level deeper than the incident, so that the work you do this week lowers the cost or the risk of the work you do next week. If your operational effort this month did not make next month's operational effort cheaper, you did maintenance. Maintenance is necessary, but it does not compound, and the whole game in a long-lived system is compounding.

A failure is an input, not the enemy

The cleanest example I have is the day an agent-written deploy script deleted most of production.

The script was good. It built the site, synced the files to a bucket, and cleaned up anything stale so the live state matched the build exactly. That last behavior, deleting whatever was not in the new build, is correct right up until the build comes up short. One day a build failed partway through, produced an incomplete set of files, and the script did exactly what it had been told. It made the live site match the incomplete build, which meant deleting most of the images from production. I have written the longer version of this in how I rebuilt this site in agent mode, but the part that matters here is what I did with the failure.

The non-recursive response would have been to fix the immediate problem. Re-upload the images, clear the cache, move on, and add a line to a runbook telling the next person to be careful. That closes the loop at the human level. The lesson lives in a person, the system is exactly as dangerous as it was, and the failure is waiting to happen again to someone who has not read the runbook.

The recursive response is to change the system so the failure cannot recur. I pushed the images back with a sync that only adds and never deletes, and then I changed the deploy itself so that a bad build can no longer reach a destructive step, and so asset uploads only ever add. The failure became a permanent improvement to the shape of the system. The system now knows something it did not know before, not in a document, in its structure. That is the only good thing a failure is for, and a failure that you spend on a runbook line instead of on the architecture is a failure you have agreed to suffer again.

This is what people mean, or should mean, by antifragile, and it is the same instinct I try to bring to everything I build, the practice of making systems that gain from disorder rather than ones that merely survive it. A robust system takes the hit and stays standing. A recursive system takes the hit and comes out the other side unable to take that particular hit again. The disorder is not the thing you are defending against. It is the input. Every incident is a free piece of information about where your system is weak, paid for in pain you have already spent. The only waste is to spend the pain and not collect the information.

The discipline that makes this routine is naming failure modes before they happen and rehearsing them. The platform itself keeps teaching me to do it: assume the part will fail, decide in advance what happens when it does, and build so that the answer is already known. What happens when a deploy is incomplete. What happens when a cache header is wrong. What happens when a path that worked on the old host does not resolve on the new one. Each of those questions, asked early, is a small recursion run on purpose instead of waiting for the system to run it on you at three in the morning.

Make the runbook unnecessary

There is a tell that separates teams that are looping recursively from teams that are just automating, and it is the runbook.

Automation looks at a nine-page runbook and asks how to execute it faster. It writes the script that runs the steps. This feels like progress, and it produces real, measurable savings, and it is the easy half of the job. The hard half, the recursive half, looks at the same nine-page runbook and asks a different question: why does this need to exist at all? A system that requires fifteen humans to interpret seven dashboards to decide whether it is safe to deploy on a Friday is not a system that needs more automation. It is a system that needs a different shape. Automation built on top of a confused architecture just amplifies the confusion at higher speed.

I have written a lot of runbooks. The work that actually moved the incident rate down was almost never the runbook. It was the work that made the runbook unnecessary. The runbook is the symptom. The architecture that produced it is the disease, and you can spend a career treating symptoms very efficiently while the disease compounds underneath.

So the recursive move on any piece of operational toil is to ask whether this pass eliminates the need for the next pass, or merely speeds it up. Automating a manual step that should not exist is a local optimum that locks in the thing you should have removed. The deeper version of the question, the one that pays off over years, is the one most organizations are bad at, because they are exquisitely good at calculating the cost of doing something and catastrophically bad at calculating the cost of not doing it. The five-year-old instance nobody will touch because nobody knows what runs on it is a cost. The fragile deploy everyone is afraid to refactor is a cost. The migration the team has spent four years not doing is a cost. Automation does not surface those. Only judgment does, and judgment is the part of this that does not get automated, which is exactly why it is the part that keeps mattering.

The platform that improves the platform

Around 2022 the industry started calling the mature version of this platform engineering, and the rename was useful because it made the recursion explicit. The platform is a product. The application teams are the customers. And the platform team's entire mandate, the whole of it, is to make the next line of application code easier and safer to ship than the previous one.

Read that mandate as a function and it is openly recursive. Each thing the platform team ships is supposed to lower the cost of the next thing everyone else ships, which lowers the cost of the thing after that. A platform feature that nobody adopts is not neutral. It is debt, because it added surface area without lowering anyone's cost of change. The platform team's success is not measured by how much they ship. It is measured by how much the application teams ship because the platform exists. That is a derivative, a rate of change of someone else's velocity, and optimizing a derivative is a fundamentally different posture than optimizing your own output.

This is the recursion at the organizational scale. You are not building the product. You are building the thing that builds the product, and then improving the thing that builds the thing, and the payoff lives in how many downstream changes each upstream change makes cheaper. A team that internalizes this stops measuring itself by tickets closed and starts measuring itself by friction removed, which is the only metric that compounds.

Agent mode closes the loop faster

Everything above was true before AI agents entered the workflow. What agent mode changes is the cost of a single pass through the loop, and lowering that cost changes how often you can afford to run the recursion.

A lot of the routine platform work that used to take a junior engineer a week now takes a senior engineer a morning. Writing the Terraform module, scaffolding the service, generating the CI workflow, drafting the runbook, and crucially, performing the reshape that makes the runbook unnecessary: all of that compresses hard. When the mechanical cost of changing the system drops toward zero, the recursion gets cheaper to run, and things that were too expensive to bother fixing become worth fixing. The half-day refactor that removes a class of incident was never worth a week of a person's time against everything else on the board. At a morning, it is obviously worth it. So the loop runs more often, and a loop that runs more often compounds faster.

The economics of the team shift with it. The old question was how many junior engineers you needed to do the rote work. The new question is how many senior engineers you need to direct the rote work the agents now do, and the ratio inverts. A platform team stops being measured by its capacity to produce configuration and starts being measured by the quality of its judgment about which configuration is worth producing. That is a smaller team doing higher-altitude work, and it is only an improvement if the judgment is actually there. Invert the ratio without the systems literacy to aim the agents and you have not made the team better. You have given it a faster way to generate operational debt, because the agents will widen whatever loop you point them at, the compounding one or the destructive one, with equal cheerfulness.

But there is a sharp edge, and it is the same edge I described in Agent Mode Changes the Shape of Thought. The agents are very good at the steps and very bad at whether the steps are the right steps. An agent will happily build a beautifully clean pipeline that solves the wrong problem. It will write a Terraform module that provisions infrastructure your team cannot operate. It will refactor a deploy script in a way that breaks a tribal-knowledge invariant nobody wrote down. This is not a flaw in the agents. It is a clarification of where the engineering always was. The agents have made it impossible to fake the systems-thinking part of the job, because they will produce an enormous volume of confident, plausible work that quietly increases the operational debt if a real judgment is not steering it.

So agent mode does not run the recursion for you. It turns the crank. It makes each pass cheap. The thing that decides whether the loop compounds or just produces more system faster is still a human deciding which failure is worth turning into a permanent improvement, which change actually lowers the cost of the next change, and which part of the system to point the loop at next. The agent makes the reshape cheap. It cannot tell you that the reshape was worth doing, or that you reshaped the right thing.

Pointing the recursion

The practical version of all of this fits on one question, which I now ask of nearly every piece of operational work: does this pass make the next pass cheaper, or does it just get me through today?

If it gets me through today, it is maintenance, and sometimes maintenance is what the day requires. But if I find that almost everything I am doing is getting me through today, I am running an open loop. I am operating the system and learning nothing into it. The recursive habit is to spend a fixed fraction of every incident and every change on closing the loop one level deeper than the problem in front of me. The incident becomes a structural change that makes it impossible. The change becomes a lowering of the cost of the next change. The runbook becomes an architecture that needs no runbook.

The signals that tell you whether the recursion is working are the boring operational numbers, and they are honest in the way that physics is honest. I have written about how AWS is math and Kubernetes is physics, how one discipline bills you in money and the other in latency and contention. Those bills are the feedback. The cost trend, the incident rate, the time it takes a new engineer to ship safely, the length of the runbook: those are the readouts of whether each pass is making the system cheaper to change or more expensive. A recursive practice bends them in the right direction over quarters, not because of any single heroic fix, but because the loop is pointed at making the next thing easier, and it keeps running.

The system that gets stronger from what hits it

DevOps started as a loop between people, and that was the right place to start, because the friction between people was the first bottleneck. The version I am describing is the same idea taken one level deeper: a loop the system runs on itself, with a human pointing it and agents turning the crank fast enough that you can afford to run it constantly. The output of the loop is not a deployment. The output of the loop is a better system for producing the next deployment, and a better system again after that.

The aim, in the end, is a system that gets stronger from what hits it. Not one that resists disorder, but one that converts it, that takes every failure and every awkward change and every painful migration and comes out structurally improved, unable to suffer that exact thing again, cheaper to change than it was the day before. That requires keeping the whole growing thing legible enough to actually reason about, which is the larger idea underneath everything I build, the practice of making a complex system legible enough to live inside, whether the system is a cloud platform, a codebase, or a life. The agents will turn the crank as fast as you let them. Whether the loop compounds is the part that is still, and will remain, yours.

One small, daily instance of this loop is having one agent review another's work, the adversarial second pass I describe in My AI Code Review Workflow: cheap to run, and a place where the recursion earns its keep before any of it ships.

If you want the broader argument for why systems literacy is the skill that survives, it is in DevOps Beyond Automation, and the book-length version of how I work this way is AgentSpek, free to start reading here. The rest of how I build lives at AI-Assisted Development.