On November 18, 2025, a configuration file error inside Cloudflare's Bot Management system caused a four-hour global outage. Spotify went down. ChatGPT went down. Discord, Uber, Canva, Coinbase, and over 22 major platforms reported widespread failures. Even Downdetector, the site people go to check if something is down, was intermittently unreachable because it also runs on Cloudflare.
The cause, when Cloudflare published its post-incident analysis, was surprisingly mundane: a database permissions change caused a query to return duplicate rows, which doubled the size of a configuration file, which exceeded a hard-coded limit in their proxy system, which crashed. One change. No one caught it. Four hours of global disruption.
Estimates of total aggregate economic impact from that single incident ran to approximately $1.4 billion across all affected sectors. Cloudflare's own stock dropped over 5% in premarket trading. For the thousands of smaller businesses whose products were hosted on platforms that ran through Cloudflare, there was no headline, no reimbursement, and no estimate of what they lost. Just an outage page, a silence, and customers who moved on.
What a four-hour outage actually costs a startup
Most founders think about outage cost in the simplest terms: revenue per hour, multiplied by hours down. That number is real, but it is also the smallest part of the actual bill. The rest does not show up on any invoice. Here is what actually accumulates.
The Heroku incident no startup should forgetIn June 2025, Heroku suffered what was arguably the largest outage in its history. An unplanned Linux OS update ran in production and triggered a systemd restart, which restarted the networking daemon, which took down network connectivity across Heroku's entire infrastructure. For nearly 24 hours, thousands of applications hosted on Heroku went offline. Dashboards were unreachable. CLI tools stopped working. Heroku's own status page, which runs on the same infrastructure, also failed, leaving engineers with no official communication channel and no visibility into progress. The root cause, when it emerged, was a single automated background process running without any pre-deployment validation or staged rollout. No canary. No alert that would have caught a networking daemon restart. No monitoring that surfaced the failure before customers did. One week later, Heroku had a second incident — eight and a half hours of dyno formation and autoscaling failures. Two incidents, within seven days, each lasting the better part of a business day. For the startups running production on Heroku, this was not a vendor problem they could abstract away. It was their outage, even though the failure was not theirs. Their customers did not read Heroku's incident report. They just saw that the product did not work. This is the thing that often gets missed in outage conversations: when your infrastructure provider goes down, your customers hold your brand responsible. They do not distinguish between your code and your vendor's code. The incident is yours. Why this matters at ten engineers, not fortyThe counterargument I hear from early-stage teams is consistent: we do not have the resources to build a serious on-call process or a proper observability stack right now. We will do it when we are bigger. The math does not support waiting. New Relic's observability research found that organizations using proper observability experienced 73% less annual downtime than those without it — 118 hours of downtime per year versus 445 hours. They also spent 28% less engineering time on incident response and 19% less on hourly outage costs. These are not enterprise numbers. These are outcomes available to any team willing to instrument their systems before the first major incident, not after. At ten engineers, you do not need a 24/7 dedicated on-call rotation with escalation tiers. You need three things that take less time to set up than you think they do. First, basic observability: error rates, latency, and service health visible to anyone on the team, not just whoever set up the dashboards. If you are running Grafana Cloud on the free tier, this costs nothing and takes an afternoon. If something breaks, someone should see it in a dashboard before a customer emails about it. Second, alerting that pages the right person. Not an email to a shared inbox. A PagerDuty or OpsGenie alert that goes to a specific person's phone when a threshold is crossed. The question "who is on-call this week" should have a one-word answer at any point in time, even at a ten-person company. One person. Not the CEO. Not whoever happens to be awake. A named engineer with a phone. Third, a post-mortem culture from the very first incident. Not to assign blame. To capture what the monitoring missed, what would have caught it earlier, and what changes in the system would reduce the probability of recurrence. Atlassian's incident research consistently shows that teams with strong post-mortem practices resolve incidents faster and reduce repeat incidents. The practice is free. The absence of it compounds. The cost that shows up last and hurts mostThere is one outage cost that does not fit neatly into any of the categories above. It is the burnout that accumulates in the engineers who get woken up repeatedly at 2am because there are no proper alerts, no runbooks, and no documented recovery process. They fix the problem in the moment, and then they fix it again next month when it happens differently, and then they start updating their resumes. Engineer attrition from repeated, unstructured on-call incidents is well documented in SRE literature and almost never factored into outage cost calculations. UptimeRobot's 2025 analysis of hidden downtime costs identifies internal velocity degradation as one of the lasting aftereffects of outages: teams add manual checks, redundant approval processes, and conservative deployment policies in response to incidents, and those responses slow shipping for months after the original incident is resolved. The engineers who stay after a string of painful incidents build scar tissue. The ones who leave take the context with them. Both outcomes are expensive, and neither appears in the post-mortem document. The honest argument for observability and on-call processes at ten engineers is not that you will definitely have a major outage soon. It is that the cost of having a major outage without those systems in place is disproportionately high relative to the cost of setting them up before you need them. Grafana Cloud is free. PagerDuty has a free tier. A runbook is a shared document. The infrastructure to catch and manage incidents properly is available to any team willing to spend the time building it. The question is whether you do it on a calm Tuesday afternoon or on a Saturday night when everything is already on fire. | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

