A bad change is survivable. Shipping it to everyone at once is what turns it into a catastrophe.

Every change you ship might be wrong. That is not pessimism, it is the base rate of software, and the only real question is how many of your users find out at the same time.

The most expensive way to answer that question is all of them at once. On 19 July 2024, CrowdStrike pushed a configuration update that had passed its internal canary tests straight to every customer at once, with no phased rollout, and within 78 minutes it had crashed 8.5 million Windows machines worldwide (WSFB, 2025; VentureBeat, 2025). The bug was real, but the bug was not the catastrophe. The deployment strategy was.

A change shipped to 100 per cent of users at once is a big-bang deployment, and it means a single fault has a worldwide blast radius. The alternative is to expose the change to a small slice first, watch it, and expand only if it holds. This post is about that choice, why big-bang releases are so tempting and so dangerous, and what progressive delivery actually buys you. No configuration here, just the strategy.

What a big-bang deployment is, and why it keeps happening

A big-bang release ships everything to everyone at once. It is tempting because it is simple, one push and you are done, it feels efficient, and the change passed staging so it should be fine. For decades this was simply the default: infrequent, monolithic releases pushing all the new features and fixes to the entire user base in one go, like dropping a boulder into a pond so the impact reaches the whole surface at the same moment (Amplitude, 2024).

The problem is the failure mode. When the change is good, nobody notices how it was shipped. When it is bad, everyone experiences the problem at once and keeps experiencing it until the update is rolled back (Amplitude, 2024). You are betting the whole system on one change being correct, every single time.

The nuance is that big-bang is not always wrong. For a tiny blast radius, an internal tool or a handful of users, the simplicity is fine and the machinery of a staged rollout would be overkill. The danger scales precisely with how many people one bad change can reach before you can stop it.

CrowdStrike: the bug was small, the blast radius was the world

The technical fault was mundane, a configuration error in a content update. It became a global meltdown because of how it was shipped. The update was pushed globally without the staged rollout that CrowdStrike's regular feature updates went through, because rapid-response content ran on a lighter, faster validation path, and engineers inside the company had already flagged the risk of pushing kernel-level updates this way (CyberThreat Dialogues, 2025).

It was reverted after 78 minutes, but by then 8.5 million machines had crashed, insurers put the loss at 5.4 billion dollars for the top 500 US companies alone, and more than 5,000 flights were cancelled (VentureBeat, 2025). A gradual, phased rollout instead of an immediate global one would have reduced the impact dramatically (CSA, 2025).

The nuance is that it is too easy to say they simply needed a canary. They had canary tests, and the update passed them. The failure was that a fast path skipped the staged production rollout, and as one observer put it afterwards, even strong practices cannot outpace the risk when the same velocity that ships fast also widens the blast radius (VentureBeat, 2025). The lesson is not one missing control. It is treating blast radius as a first-class concern.

The update passed its internal canary tests and was rolled out to everyone at once, with no phased deployment to limit the damage (WSFB, 2025). The bug was the fault. The deployment strategy was the catastrophe.

Progressive delivery: let a small slice fail so everyone else does not

The alternative is to never expose everyone at once. You ship the change to a small fraction of users first, a canary, route perhaps 1 per cent of traffic to it, validate the error rates and latency against the old version, then step up to 5, 25, 50, and 100 per cent, pausing or rolling back at any step that looks wrong (DEV, 2025).

Ring-based rollouts formalise the same idea, defining explicit rings of users that are updated one after another, a technique large organisations like IBM have used for years (Amplitude, 2024). Whatever the shape, the entire point is the same: to limit the blast radius of any single change so that a fault is contained to a small group rather than the whole system (OpsMx, 2025).

The nuance is that a canary only works if you are actually watching it and can stop. Progressive delivery depends on real monitoring, traffic routing, and automated rollback, and without the observability to tell you the canary is sick, you have not prevented your outage, you have only slowed it down (DEV, 2025).

Ship the change to 1 per cent of users, watch the error rates and latency, then step up to 5, 25, and 100, halting the moment something looks wrong (DEV, 2025). The bad change still happens. It just happens to 1 per cent of people instead of all of them.

Containment is not the same as prevention

Progressive delivery limits damage, it does not stop bugs, and it has a second-order failure that CrowdStrike showed clearly. Even after the root cause was fixed, Delta was disrupted for days because its crew-tracking software could not handle the backlog, a failure of automated recovery rather than of the deployment itself (Finster, 2024). The change being contained is not the same as the system being able to recover.

Progressive delivery also has real costs. Rollouts take longer because they are gradual, you can only run one carefully at a time, and a problem at one stage halts the rest until it is resolved, which can make your delivery milestones harder to read (ConfigCat). These are not reasons to avoid it, but they are real, and pretending otherwise is how teams adopt the label without the discipline.

The nuance, and it is the important one, is that the answer to all of this is not to deploy less often. That instinct, slow everything down to feel safe, is exactly backwards. The right response is layered safety, a smaller blast radius, monitoring, fast rollback, and recovery that can absorb a backlog, all at once (Finster, 2024).

How to choose, and what one bad change should cost you

The practical rule is to size your deployment strategy to your blast radius. The strategies form a spectrum: a big-bang push that takes everyone live at once, then rolling and blue-green deployments, and at the careful end a canary that ships to 1 per cent first, each trading a little speed and simplicity for a smaller blast radius (DEV, 2025). The more users or machines one change can break at once, the further toward that careful end you should sit.

The nuance is that the test is simple, and you can run it today. Imagine your next change is broken, and ask how many users or machines it would take down before you noticed and stopped it. If the honest answer is all of them, then your deployment strategy is the risk, not your code, and no amount of extra testing changes that.

The part worth sitting with

So go back to the question you started with: when you ship your next change, and some fraction of changes are always wrong, how many of your users find out at the same time? If the answer is all of them, you have built your reliability on the hope that you never make a mistake, and CrowdStrike is a 5.4 billion dollar reminder of how that ends. The fix is not to stop shipping, and it is not to test until you are certain, because certainty is not on offer. It is to decide, before anything breaks, that no single change gets to reach everyone at once. Ship it to a few, watch, and expand only when it holds. A bad change will still slip through, because they always do. The only thing you get to choose is whether it takes down 1 per cent of your users or all of them, and you choose that in your deployment strategy, long before the bad change exists.

Author note

I am Manjunaathaa, an Associate DevOps Engineer at Frigga Cloud Labs. I work across AWS, GCP, and Azure daily, with GitHub Actions as my deployment backbone. My focus is Proactive Resilience: a deployment is only as safe as the feedback loop watching it, which is why I care more about how fast I can see a change going wrong than about how confident I was that it would not. Every practice in this post is something I actually run in production, not something I read about. I wrote this because the most expensive outages of the last few years were not break-ins, they were self-inflicted pushes to everyone at once. The thing I keep coming back to is that you cannot test your way to certainty, so you design for the bad change instead: a small blast radius, real monitoring on the canary, and a rollback you have actually rehearsed. Ship to a few, watch closely, expand only when it holds. Let's connect on LinkedIn → Manjunaathaa.