Most Rollback Plans Fail at 2am: How Elite Teams Turned Recovery Into a Practiced Reflex

Most teams do not have a rollback. They have a hope that they will improvise one when something breaks.

The data on how that goes is stark. In the 2024 DORA State of DevOps report, elite teams restore service after a failed deployment in under an hour, while low performers take somewhere between one month and six months (DEV, 2024). That gap is not about who has the smartest engineers. It is about who decided how to recover before they needed to.

Rollback gets treated as a button you press in a crisis. It is not. It is a capability you build and rehearse in advance, and most teams discover the difference at the worst possible hour. This post is about why so few teams can actually roll back at 2am, and what separates a rollback that is a plan from one that is a panic.

Recovery time is decided before the incident, not during it

Failed deployment recovery time is one of DORA's four key metrics. It measures how long it takes to restore service when a deployment causes an outage (Multitudes, 2025). The important thing about that number is that you do not get to set it during the incident. It is already fixed by choices you made weeks earlier: whether the previous build still exists, whether the change can be reversed at all, and how much you bundled into the release.

DORA is explicit that rollback is a normal part of running software, not an embarrassing exception. Its definition of a failed change counts anything that needs a hotfix, a rollback, or a fix forward to recover (Multitudes, 2025). Recovering from a bad deploy is expected. Being unable to is the failure.

A fast recovery time on a dashboard can still mislead, because a team that has only ever had small, easy incidents has not really been tested. The number tells you how you did, not how you will do under a genuinely bad one. It is worth trusting only once you have rehearsed the bad one.

In the 2024 DORA report, elite teams restore service after a failed deployment in under an hour. Low performers take between one month and six months (DEV, 2024). The difference is not talent. It is whether rollback was built before it was needed.

Why most teams cannot actually roll back at 2am

When a team finally reaches for the rollback, this is where it tends to fail. The schema migration went out coupled to the code, and a dropped column or a rewritten table does not reverse the way a redeploy does. The previous build was overwritten or was never versioned, so there is nothing clean to go back to. Configuration and data have drifted since the release, so the old code no longer matches the world it would run in. Or the release was one large batch, so rolling it back undoes a fortnight of unrelated work along with the fix.

Underneath all of these is the same mistake. The team optimised for shipping and never built the path back. As one engineer's summary of the DORA findings puts it, speed means nothing if you cannot recover quickly when things break (Dibeesh, 2025). A pipeline that can only go forwards is not fast. It is fragile.

Fix forward is sometimes the right call. A one-line configuration flip is often quicker to correct than to revert. The point is that fix forward should be a choice you make because it is better, not the only option you have because rollback was never possible.

Marks and Spencer shows what "we will figure it out" costs

When restoration is not designed in, recovery is not measured in minutes. It is measured in weeks. Marks and Spencer is the clearest recent example. Its April 2025 incident was a cyberattack rather than a failed deploy, so it sits at the extreme end, but the lesson is identical. The retailer suspended online clothing orders for seven weeks, and its first-half underlying profit fell by more than half (Insurance Journal, 2025). Click and collect did not come back for around fifteen weeks (TechRadar, 2025).

The bill was roughly three hundred million pounds in lost operating profit, and at one point the company was reduced to tracking stock with pen and paper (BlackFog, 2025). The lesson M&S itself took to Parliament was blunt: organisations should be able to keep operating manually when their systems are down (TechRadar, 2025). The time to design your fallback is before the outage, not during it.

A ransomware recovery and a deployment rollback are different playbooks, and I am not pretending otherwise. What they share is the failure mode. Both go badly for exactly the same reason, which is that nobody built and tested the way back while there was still time.

M&S suspended online clothing orders for seven weeks after its April 2025 attack, and click and collect stayed down for around fifteen (Insurance Journal, 2025; TechRadar, 2025). Recovery you have not planned is not measured in minutes. It is measured in weeks.

What makes rollback a plan

The teams that recover in under an hour are not braver. They built a few unglamorous things in advance. They keep immutable, versioned builds, so the last good version is always sitting there ready to redeploy. They make changes backward compatible, so the old and new versions can both run against the same database and a revert is safe. They keep releases small, so a rollback undoes one change rather than a fortnight of them. They separate the parts that cannot be reversed, like schema changes, from the parts that can, so the deploy itself stays reversible.

Feature flags belong in this list too. If a risky change ships behind a flag, you can switch it off in seconds without redeploying anything, which is the fastest rollback there is. None of this is exotic. It is the same set of capabilities DORA keeps finding behind elite recovery times (Multitudes, 2025).

These work together or not at all. Immutable builds do not help if every release is a huge batch. Feature flags do not help if the same change also shipped an irreversible migration. The plan is the combination, not any single piece of it.

The rollback you have never tested is not a rollback

Here is the part almost everyone skips. A rollback procedure that has never been run is not a capability, it is a document. The first time you test it should not be during a real outage at 2am, with customers watching and the adrenaline running.

The teams that can actually recover practise it. They roll back in a safe environment on a normal afternoon. They run game days where they deliberately break something and restore it. They confirm the previous build still deploys, that the database tolerates the old code, and that whoever is on call knows the steps without reading a wiki. A recovery time you have measured under practice is real. One you are assuming is a guess (Gitmore, 2026).

You cannot rehearse every scenario, and you do not need to. But you should have run the common one, the bad deploy, end to end at least once. If you have never reverted a release on purpose, you do not yet know whether you can do it by accident.

The part worth sitting with

So ask the honest question now, not during the next incident. If the deploy you ship tomorrow takes the site down, can you put it back, and have you ever proved it? Most teams cannot answer yes, and they learn that at 2am. The rollback is not the button you reach for in the dark. It is the work you did weeks earlier so that the dark is uneventful. Marks and Spencer is what the other answer looks like at scale: seven weeks of suspended orders and three hundred million pounds, because recovery was never a plan. Your incident will be smaller. It will still take your night, and it will still have been decided in advance, by whether you treated rollback as something to sort out later. Later is 2am. Sort it out now.

Author note

I am Mohan Gopi, an Associate DevOps Engineer at Frigga Cloud Labs, working across AWS, GCP, and Azure with GitHub Actions as my deployment backbone. I wrote this because rollback is the capability teams are most confident about and least prepared for. The pattern I keep seeing is a team that swears it can roll back, until the night it has to, when the migration will not reverse, the old build is gone, and nobody has run the revert in months. Recovery time is not a number you improve during an incident. It is one you earn before. Let us connect on LinkedIn → Mohan Gopi.