The on-call playbook for startup engineering teams. Because most teams build one after the incident that made it unavoidable.


There is a version of this conversation that happens in almost every startup at a predictable moment. Production breaks in the middle of the night. The team scrambles to figure out who is responsible, who should be contacted, and what the recovery process actually is. Nobody has a clear answer to any of those questions because nobody has written them down. The incident gets resolved eventually, usually through a combination of tribal knowledge and someone willing to stay up until 3am. The postmortem identifies the technical root cause. It rarely identifies the organisational root cause: that the team had no on-call process, and this was always going to happen.

What is consistent across the startups I work with through Frigga Cloud Labs is that the teams that build an on-call process proactively almost never look back and wish they had waited longer. The teams that build it reactively, in the aftermath of a painful incident, always wish they had done it earlier. The gap between those two experiences is not the technical complexity of setting it up. It is whether someone decided to treat it as a deliberate process rather than something that would sort itself out.


Why most startup teams do not have an on-call process until they need one badly

It feels like something bigger companies need

The most common version of the objection I hear is some variation of: we are too small for a formal on-call rotation. We all watch the system. If something breaks, one of us will see it. This logic holds right up until the moment it does not. Everyone watching is the same as no one watching, because there is no clarity about who is responsible, and responsibility without clarity diffuses into nothing the moment something breaks at an inconvenient time.

Industry data from 2024 shows that 82% of organisations report mean time to resolution over one hour, up from 47% in 2021. Incident volumes increased 16% year over year while average engineering team sizes stayed flat. The problem is getting harder, not easier, without a process to contain it.

The business cost is not just the downtime

What gets underestimated is not the cost of the incident itself. It is the cost of the aftermath. The engineer who got woken up three times in a week and starts updating their LinkedIn. The enterprise prospect who asks about your uptime track record during a sales call three months after the incident. The CTO who spends two days in retrospective conversations instead of architecture work. These costs do not appear in the postmortem document, but they are real, and they accumulate.


How to structure an on-call rotation when the team is small

The four-person team is not too small to have a rotation

A four-person engineering team with a weekly rotation means each person is on-call one week in four. That is thirteen weeks of on-call responsibility per year per person, with three weeks of relief between each stint. That is manageable. What is not manageable is four people implicitly sharing responsibility with no structure, which in practice means the most experienced person gets called every time because they know the system best, and they burn out quietly over several months before anyone understands what is happening.

The rotation does not need to be complicated. One primary on-call person per week. One secondary who serves as escalation if the primary does not respond within a defined window. For a four-person team, those two roles rotate weekly. The schedule should be published at least two weeks in advance so people can plan around it. Atlassian's guidance on on-call best practices is consistent with what works in practice: predictability and transparency matter as much as the rotation model itself. Team members who know exactly when they are on-call can plan their lives around it. Surprise on-call duty is a morale problem regardless of how infrequent it is.

The handoff is as important as the rotation itself

One of the most consistent failure patterns in startup on-call processes is that the rotation exists but the handoff does not. The outgoing on-call person finishes their week and the incoming person starts with no context about what happened, what alerts fired, what was silenced and why, and what risks are elevated right now. This means every rotation starts with an information gap that only gets filled when something breaks and the new on-call person has to piece together the context under pressure.

A thirty-minute handoff meeting at the start of each rotation, where the outgoing person walks the incoming person through active incidents, silenced alerts, and any upcoming deployments or changes, eliminates that gap. It does not need to be elaborate. It needs to be consistent.


Runbooks: what they are and what makes them actually useful

The difference between a runbook and documentation nobody reads

The runbooks that get used during incidents are short, specific, and answer one question: what do I do right now? They are not architecture documents. They are not explanations of how the system works. They are a list of steps, written for someone who is under pressure at an inconvenient hour and needs to know the next action to take, not a lecture on the system's history.

A useful runbook for a specific incident type covers the symptom, the likely causes in order of probability, the specific checks to run, the steps to take for each likely cause, and who to escalate to if those steps do not resolve it. That is it. The format that consistently works is a checklist, not a manual. If someone reads a runbook and thinks "I need more context," the runbook has failed at its job.

Where to start if no runbooks exist yet

The most practical starting point is to identify the five incidents that have actually happened in the last quarter and write a runbook for each one. Not the incidents that might happen. The ones that did happen, where someone had to figure out the resolution under pressure and the knowledge now exists only in that person's head. Those are the highest-value runbooks to capture, because they address real failure modes with known resolution paths.

Runbooks belong somewhere that can be reached during an incident without logging into anything complicated. A shared Notion page, a GitHub wiki, or a Confluence space linked directly from the alert itself. The best practice is to link runbooks directly from alerts so the on-call engineer receives a page and a link to the relevant playbook in the same notification. The distance between the alert and the runbook should be zero clicks.

A runbook written the day after an incident, while the resolution is still fresh, is worth ten times the runbook written speculatively before anything has broken. The knowledge is real, the steps are verified, and the next time it happens, someone else can resolve it in a fraction of the time.


Escalation paths: who gets called and in what order

The three-tier model that works at startup scale

A startup does not need a complex escalation framework. It needs an answer to three questions that anyone on the team can answer without checking anything. Who is on-call right now? What happens if they do not respond in fifteen minutes? Who is the final escalation for a P0 incident?

The model that works: the primary on-call engineer is the first contact for any production alert. If they do not acknowledge within a defined window, typically ten to fifteen minutes, the alert escalates automatically to the secondary. If the secondary does not acknowledge within the same window, the escalation reaches the Head of Engineering or CTO. The escalation to leadership should be rare. If leadership is being escalated to frequently, the signal is not that the escalation path is working, it is that something upstream is broken: either the alerts are not actionable, the runbooks are inadequate, or the primary on-call engineers need more support.

Severity levels make escalation decisions faster

Not every alert deserves a 3am phone call. Defining severity levels, even simply, removes the decision from the on-call engineer in the moment. A P0 is a customer-facing outage or data integrity issue. It escalates aggressively and immediately. A P1 is a degraded service that customers are experiencing but can still use. It requires acknowledgement and a fix within a defined window. A P2 is an internal issue with no current customer impact. It can wait until business hours. Having these defined in writing means the on-call engineer does not have to make a judgment call about how urgently to escalate at midnight when they are half-asleep and would prefer to handle it quietly themselves.


Tools that work at startup scale without enterprise pricing

PagerDuty

PagerDuty remains the most established option for on-call management. The free tier covers up to five users with basic on-call scheduling and escalation policies. The Professional plan starts at $21 per user per month and adds unlimited API calls and SMS alerts. For a four-to-eight person engineering team that wants a proven, well-documented tool with a large ecosystem of integrations, PagerDuty is the safe default. The interface is not the most modern, but it does what it needs to do reliably, and the breadth of monitoring tool integrations means it works with almost any existing stack.

incident.io

incident.io is the better choice for teams that live in Slack and want incident management to feel native to where the team already communicates. Every incident gets its own automatically created Slack channel. Workflows, updates, and postmortems all happen inside Slack without context switching. The Basic plan is free for single-team on-call, with the Team plan starting at $15 per user per month plus a $10 per on-call user add-on. For a startup where Slack is the operating system of the engineering team, incident.io makes the incident experience feel less like using a separate tool and more like a structured version of what the team already does informally.

Better Stack

For very small teams that want monitoring, on-call scheduling, and a status page in one product without managing multiple tools, Better Stack offers a free tier that covers up to ten monitors, a status page, one on-call responder, and Slack and email alerts. It is not as deep as PagerDuty or incident.io for complex incident workflows, but for a team in its first year that needs basic coverage without a procurement conversation, it is a reasonable starting point.


The on-call process a team has reflects how seriously it treats reliability as an operational discipline. The teams that treat it as overhead to be minimised tend to discover its value the hard way. The teams that treat it as a deliberate system, one that distributes load fairly, gives engineers the context they need to respond confidently, and escalates predictably when someone cannot respond, are the ones where senior engineers stay, incidents resolve faster, and the postmortems are actually about the technical root cause rather than about the fact that nobody knew who was responsible. That is a decision that can be made on a quiet afternoon.  It becomes much harder to make well in the middle of an incident.

Ayesha Siddiqua

The pattern I observe most often in early-stage engineering teams is not that they ignored on-call, it is that they assumed it would figure itself out. It never does. At Frigga Cloud Labs, we work with growing startups to put these processes in place before the incident that would have made them unavoidable. If your team is at the stage where everyone is responsible and nobody is accountable, that is the conversation worth having now.

:paperclip: Let's connect on LinkedIn

Post a Comment

Previous Post Next Post