On-call burnout does not send a calendar invite. It sends a resignation letter.



It rarely announces itself. There is no dramatic conversation, no ultimatum, no visible breaking point. What I tend to observe is quieter than that. The senior engineer who stops volunteering opinions in postmortems. The person who used to push back on technical decisions and has stopped. The one who is still showing up, still delivering, but something has shifted. And then, three weeks later, the resignation comes, and in the exit conversation, when someone finally asks the honest question, the answer involves some version of: I was tired of being the person production called at midnight.

On-call burnout does not show up in dashboards. It does not appear in sprint velocity or quarterly OKRs. It accumulates in people, quietly, over rotations that ask too much of too few, until the person who knows the system best decides that knowing the system is no longer worth what it costs them.


The numbers behind what feels like a people problem

It is not a coincidence. It is a pattern.

42% of engineers who leave their roles cite on-call burden as a primary driver. Not salary. Not career growth. The pager. Nearly half of attrition risk in an engineering team sits directly inside a solvable systems problem, and most startups are treating it as an inevitable cost of running production rather than a design failure worth addressing.

65% of engineers reported experiencing burnout in the past year, according to the 2024 State of Engineering Management Report, with on-call stress identified as a major contributing factor. LeadDev's Engineering Leadership Report 2025 found that 22% of engineering leaders are at critical burnout levels, with 38% working longer hours than the year before. These are not edge cases. This is the baseline experience of engineering teams right now, and on-call load is one of the most consistent accelerants.

The day after a 2am page costs more than the incident itself

What rarely gets quantified is the productivity loss that sits around every overnight incident. On-call engineers operate at roughly 80% capacity the day after a page. That 20% reduction compounds across the entire rotation, across every engineer who carried the pager through a bad week. It does not appear on any dashboard. It shows up as slower decisions, missed signals, code reviews that take longer than they should, and a team that is running slightly below its own capacity for reasons that look invisible from the outside.

When the engineer eventually leaves, the cost becomes visible in a different form. The average attrition rate in the tech industry runs between 13% and 21% annually. Replacing a senior engineer means months of recruiting, months of onboarding, and a gap period where the tribal knowledge that person carried, the understanding of why that service behaves that way at that specific traffic level, simply does not exist in the team. That knowledge does not transfer in a handover document. It was earned through exactly the kind of midnight incidents that drove them out.


The specific dynamic that makes small engineering teams more vulnerable

When the rotation is too small, load concentrates on the people who know the most

In a startup with five or six engineers, on-call does not distribute evenly in practice, even when the rotation schedule says it does. The alerts that require real judgment, the ones that cannot be resolved by following a runbook, consistently find their way to the most experienced person. The junior engineers escalate. The senior resolves. Week after week, the rotation looks balanced on paper and feels entirely unbalanced in reality.

When fewer than five engineers share 24/7 coverage, each person gets paged far more often than is sustainable. The average on-call engineer already receives more than 30 alerts per shift, with up to 67% requiring no action at all. That noise alone is exhausting. When that noise sits on a rotation where the same two or three people bear the real cognitive load of every serious incident, the timeline to burnout shortens considerably.

The self-reinforcing cycle nobody plans for

Here is the pattern I observe most often. A senior engineer burns out and leaves. The rotation shrinks. The remaining engineers absorb more shifts. Higher load accelerates the same cycle in whoever is left. The team loses another experienced person. The rotation shrinks again. What started as a systems problem that was solvable becomes a retention crisis that is significantly harder to reverse.

The instinct at that point is usually to hire. But hiring into a team where the on-call experience is already known to have driven people out is a harder conversation than fixing the system that drove them out in the first place.

On-call burnout does not cost you an engineer. It costs you the engineer who understood the system well enough to fix it at 2am, and whose replacement will spend their first six months learning what that person knew on day one.


What a fair on-call system actually protects

Fairness is an engineering problem, not a culture problem

The conversation about on-call fairness tends to get framed as a cultural or management issue, which puts the solution in the wrong category. Fairness in on-call is a systems design problem. It is about whether the load is actually distributed across the rotation or whether it concentrates on the people with the most context. It is about whether the escalation path is automatic or whether it depends on someone noticing that the first alert went unanswered. It is about whether noise gets filtered before it reaches a human or whether every threshold breach at 3am wakes someone up.

Teams where on-call is genuinely sustainable share a few consistent characteristics. The rotation is wide enough that each person is primary for no more than one week in four. Alerts are regularly audited and anything that has not required human action in 90 days is reviewed or removed. Escalation paths are automatic, not dependent on a person deciding to forward a message. And the cognitive load of a shift is bounded: the system handles the known failures, and humans are reserved for the genuinely unknown ones.

The connection between alert load and who stays

Almost universally, when I look at what separates engineering teams with low on-call attrition from teams that cycle through people, the difference is not team size or salary. It is whether the alert load is manageable and whether the escalation system is designed well enough that one person is never the single point of failure for production at midnight.

An engineer who knows that if they miss a page, the system will escalate appropriately rather than silently fail, experiences on-call differently from an engineer who knows that if they miss a message, nothing happens until a customer notices. The first is a rotation someone can sustain. The second is a rotation someone tolerates until they find a reason to stop.


Where the alerting system fits into this

Alert noise is not an aesthetic problem

When up to 67% of alerts require no action, the engineers receiving them stop trusting the system. The phone buzzes at 2am, the engineer does not know before looking whether this is the one that matters or one of the twelve that did not. That uncertainty is its own form of exhaustion. It does not feel as dramatic as a three-hour outage, but it accumulates in the same direction.

An alerting system that routes by severity, suppresses known noise, and escalates automatically when there is no acknowledgement changes the on-call experience in a concrete way. The engineer receives fewer alerts. The ones they receive are the ones that require them. And when they do not respond, the system does not fail silently. It finds the next person in the chain, on a timer, without requiring anyone to notice the gap.

This is what we built Shankh to address. Not the monitoring. Not the runbooks. The specific gap between an alert firing and a human reliably owning it, with automatic escalation that distributes load fairly and does not let any single engineer become the permanent answer to midnight. The teams that get this right do not just resolve incidents faster. They keep the engineers who know how to resolve them.


The senior engineer who updated their LinkedIn last month was not lost to a competing offer. They were lost to a system that made midnight their personal responsibility one too many times. That is a recoverable problem, but only if it gets treated as a systems design question rather than a retention mystery. The on-call experience is one of the most controllable variables in whether your best engineers stay. Most startups are not controlling it at all.

Ayesha Siddiqua

The resignation conversations I have heard about most often in growing startups are not about salary or title. They are about exhaustion that built up slowly over months of being the person production called when something broke. At Frigga Cloud Labs, we work with engineering teams to make on-call sustainable before it becomes a retention problem. If your rotation is already concentrating on two or three people, that conversation is worth having now.

:paperclip: Let's connect on LinkedIn


Post a Comment

Previous Post Next Post