SRE vs DevOps: what a 15-person startup actually needs from each, and what it does not need yet.

The question of whether a startup needs SRE comes up a lot in engineering conversations, usually framed as a choice between two competing approaches. It is not a choice. DevOps and SRE are not alternatives. DevOps is a cultural and operational philosophy focused on delivery velocity. SRE is a set of engineering practices focused on measurable reliability. A team can and should use both, but the practices that make sense at 15 engineers are different from the ones that make sense at 150.

Working across AWS, GCP, and Azure with GitHub Actions as the deployment layer means thinking about both sides of this daily: how to keep deployments moving fast and how to keep systems stable under real traffic. The honest answer to "does your startup need SRE" is: some of it, right now. The rest later, when the complexity justifies the overhead.


What the two disciplines actually mean


DevOps

DevOps is the practice of breaking down the wall between development and operations. Its primary outputs are CI/CD pipelines, Infrastructure as Code, automated testing, and shared ownership of production. The metrics it optimises for are deployment frequency, lead time for changes, change failure rate, and recovery time, the four DORA metrics. A DevOps-mature team deploys frequently, safely, and with a fast feedback loop from production.

SRE

SRE was invented at Google in 2003 as a specific implementation of DevOps principles with engineering rigour. Where DevOps says "automate everything," SRE asks "how much downtime can we afford while still moving fast?" The core instruments are Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. SRE also introduces a formal concept of toil: manual, repetitive operational work that scales linearly with system size and should be automated. Google's SRE book recommends that engineers spend no more than 50% of their time on toil. Above that threshold, toil reduction becomes the priority.

The key distinction is measurement. DevOps gives you fast delivery. SRE gives you quantified reliability targets and a mechanism for deciding when reliability work should take priority over feature work. That mechanism is the error budget.


SLIs, SLOs, and error budgets: what they actually mean in practice

SLI: the metric you measure from the user's perspective

An SLI is not an infrastructure metric. CPU at 80% is not an SLI. An SLI is a measurement of what users actually experience. For an HTTP API, the two most common SLIs are availability (the ratio of successful requests to total requests) and latency (the ratio of requests completing faster than a threshold to total requests).

# Prometheus recording rule: availability SLI
# Ratio of non-5xx responses to total requests over 5-minute window
- record: job:sli_availability:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{status!~"5..", job="api"}[5m]))
    /
    sum(rate(http_requests_total{job="api"}[5m]))

# Latency SLI: ratio of requests under 300ms threshold
- record: job:sli_latency:ratio_rate5m
  expr: |
    sum(rate(http_request_duration_seconds_bucket{
      job="api", le="0.3"
    }[5m]))
    /
    sum(rate(http_request_duration_seconds_count{job="api"}[5m]))

SLO: the target you set for that metric

An SLO is a specific, measurable target for an SLI over a defined window. "99.9% of requests succeed over a rolling 30-day window" is an SLO. "The system should be reliable" is not. The target needs to be set based on actual historical performance and what the business can genuinely commit to. Starting at 99.9% availability and 99% latency for a new service is reasonable. These can be adjusted after a few months of real data.

What different availability targets mean in practice:

# SLO target   | Monthly error budget  | Equivalent to
# 99%          | 7.3 hours             | One bad afternoon
# 99.5%        | 3.6 hours             | A couple of incidents
# 99.9%        | 43.8 minutes          | One short outage
# 99.95%       | 21.9 minutes          | Half an incident
# 99.99%       | 4.3 minutes           | Almost no room at all

For most startups, 99.9% availability is the right starting point. Chasing 99.99% at 15 engineers is over-engineering reliability at the cost of velocity. The error budget from 99.9% gives the team 43 minutes of downtime per month before the SLO is breached. That is enough room to move fast without being reckless.

Error budget: the math that connects the two

The error budget is 100% minus the SLO target. If the SLO is 99.9%, the error budget is 0.1%, which translates to roughly 43 minutes of allowed downtime per month. When that budget is plentiful, the team ships features. When it is running low, the team prioritises reliability work. This is the mechanism that makes the feature velocity versus stability trade-off explicit and data-driven rather than a recurring argument between engineering and product.


Alerting on burn rate, not threshold

The most important practical shift SRE brings to alerting is moving from threshold-based alerts to burn-rate-based alerts. A threshold alert fires when the error rate exceeds 5%. That fires regardless of whether it matters: 5% errors during a low-traffic period at 2am is very different from 5% errors during peak traffic on a Friday afternoon. A burn rate alert fires when the error budget is being consumed faster than sustainable, which is the question that actually matters.

Burn rate is how fast the error budget is being consumed relative to the SLO window. A burn rate of 1 means the budget will be exactly exhausted by the end of the 30-day window. A burn rate of 14.4 means the entire monthly budget will be gone in 2 days. The Google SRE Workbook recommends a two-window, two-burn-rate alerting approach: a fast burn alert for urgent pages and a slow burn alert for tickets:

groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: page the on-call engineer immediately
      # 14.4x burn rate over 1h AND 6x over 6h
      # Budget exhausted in < 2 days
      - alert: HighErrorBudgetBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5..", job="api"}[1h]))
              /
              sum(rate(http_requests_total{job="api"}[1h]))
            )
          ) / (1 - 0.999) > 14.4
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5..", job="api"}[6h]))
              /
              sum(rate(http_requests_total{job="api"}[6h]))
            )
          ) / (1 - 0.999) > 6
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API burning error budget 14x faster than sustainable"
          description: "At current rate, monthly error budget exhausted in < 2 days"

      # Slow burn: open a ticket, fix during business hours
      # 3x burn rate over 1h AND 1x over 3 days
      - alert: MediumErrorBudgetBurn
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5..", job="api"}[1h]))
              /
              sum(rate(http_requests_total{job="api"}[1h]))
            )
          ) / (1 - 0.999) > 3
          and
          (
            1 - (
              sum(rate(http_requests_total{status!~"5..", job="api"}[3d]))
              /
              sum(rate(http_requests_total{job="api"}[3d]))
            )
          ) / (1 - 0.999) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API slowly burning error budget"

The two-window approach dramatically reduces false positives. A single 5-minute spike in errors triggers neither alert because it does not show up in the 6-hour or 3-day window. A sustained degradation that would exhaust the budget fires both windows and pages immediately. This is the shift from reactive alerting to proactive budget management.


Toil: the SRE concept most useful for small teams

Toil is manual, repetitive operational work that scales linearly with system size and has no lasting value. Restarting a service that crashes every night. Manually rotating a secret that should be automated. Running a deployment script by hand that should be in a pipeline. Responding to the same alert that has been firing for six months because nobody fixed the root cause.

Google's SRE book defines 50% as the threshold: if engineers are spending more than half their time on toil, toil reduction becomes the explicit engineering priority. For a 15-person startup this matters more than the SLO framework. Toil at small scale compounds quickly. Every hour spent on manual operational work is an hour not spent on product engineering or reliability improvement. Measuring toil is straightforward: track time spent on reactive operational tasks over a two-week period. If it exceeds 30% consistently, the team has a toil problem worth addressing systematically.

The fix is not discipline. It is automation. Runbooks get automated. Recurring alerts get root-caused and fixed. Manual deployment steps get added to the pipeline. Each automation permanently removes a category of toil rather than just resolving the individual instance.


What a 15-person startup should actually do

Do now: SLOs for the one or two services that matter most

A startup does not need SLOs for every service. It needs an SLO for the service whose failure costs the most. For a SaaS product, that is typically the authentication flow and the core product API. Define availability and latency SLIs for those two services, set a 99.9% availability SLO, and track the error budget on a Grafana dashboard. The burn rate alerts above can be running in under half a day. This is the SRE practice that delivers the most value at the smallest team size.

Do now: Blameless postmortems for every P0 incident

A blameless postmortem is a structured analysis of what happened during an incident, focused on the system failures that allowed it to occur rather than the person who made the change. The output is a set of action items that make the system more resilient to the same failure mode. This requires no tooling and no dedicated SRE role. It requires a template and the discipline to complete it after every significant incident. A 15-person team that does this consistently will ship a more reliable system than a 50-person team that skips it.

Do now: Measure and reduce toil systematically

Track operational time for two weeks. Identify the top three sources of recurring manual work. Automate one per sprint. This is SRE practice at startup scale without any of the organisational overhead.

Do later: Dedicated SRE role or formal error budget policy

A dedicated SRE role, separate from the engineers building the product, makes sense when the operational load is large enough that it would otherwise consume more than half of a product engineer's time. For most startups, that threshold is somewhere between 30 and 50 engineers running production services with real uptime SLAs to customers. Before that point, SRE practices embedded in the existing engineering culture deliver the same outcomes without the organisational complexity of a separate function.

A formal error budget policy, where the product organisation agrees to halt feature work when the error budget is exhausted, requires the kind of cross-functional alignment that takes time to build. It is worth pursuing, but it is more of a month-six conversation than a month-one prerequisite.


The framing of SRE versus DevOps is a false choice. DevOps gives the team the delivery infrastructure: CI/CD pipelines, Infrastructure as Code, automated testing, shared ownership of production. SRE gives the team a measurement framework for reliability and a principled way to decide when reliability work should take precedence over feature work. A 15-person startup needs both. What it does not need is the full organisational apparatus of a dedicated SRE team, formal error budget enforcement processes, and separate reliability roadmaps. Those come later, when the system complexity and the uptime expectations from customers justify the investment. What the team needs now is SLOs on the critical path, burn rate alerting instead of threshold alerting, blameless postmortems, and a systematic approach to reducing toil. All of that is achievable without adding headcount.

Author note

Mohan Gopi

Associate DevOps Engineer at Frigga Cloud Labs. Works across AWS, GCP, and Azure with GitHub Actions as the deployment backbone. This blog comes from working on both sides of the DevOps and SRE boundary daily: keeping deployment pipelines fast while also building the reliability measurement layer that makes it safe to deploy frequently. The practical question of which SRE practices actually make sense at small team scale is one that comes up constantly, and the answer is more nuanced than most blog posts on the topic acknowledge.

Let's connect on LinkedIn → Mohan Gopi

Post a Comment

Previous Post Next Post