Why 60% of Your Production Incidents Still Require Manual Intervention

Across the Series A landscape, a consistent—and frustrating—pattern is emerging among teams of 20 to 80 engineers. They follow the standard scaling playbook: hire DevOps, clean up the environments, and accelerate delivery. Yet, according to industry benchmarks and internal data, these "top-performing" teams are often still drowning in manual interventions and individual dependencies. The hard truth is that optimizing how work gets done is not the same as optimizing how systems behave. As we look at the divide between high-performing DORA metrics and actual daily stability, it’s becoming clear that the next evolution of infrastructure isn't about better tooling—it’s about the shift from execution to predictability.

Recently, I was talking to a Founder of a Series A SaaS company with ~30 engineers who went through this shift. After hiring their first DevOps engineer, deployment time dropped from 30+ minutes to under 10. CI pipelines were cleaner, environments were structured, and visibility improved. But over the next quarter, internal data showed:

Around 40% of production incidents still needed manual intervention
More than 60% of critical issues depended on the same engineer to resolve

From the outside, everything looked “set up properly.”
Inside the team, confidence in the system hadn’t really improved.

This is not unusual.

Experts says, Across startups in the 20 to 80 engineer range, a consistent pattern is showing up: Infrastructure is becoming more organized, but not necessarily more predictable. Teams are shipping faster, but still approaching deployments with caution.

Even in broader industry benchmarks like Accelerate and DORA, the top performing teams are not just the ones deploying frequently. They are the ones where:

Failure rates are low
Recovery does not depend on individuals
System behaviour is consistent under stress

This has been consistently highlighted in the Accelerate State of DevOps Reports published by Google Cloud. https://cloud.google.com/devops/state-of-devops

That last part is where most growing teams struggle.

What is changing in how better teams operate is subtle but important.

They are moving away from thinking of DevOps as a role that “handles infra” towards treating infrastructure as a system that needs to behave consistently. This shift is also visible in broader platform engineering and infrastructure trends.

The CNCF Platform Engineering Report highlights how teams are moving toward internal platforms and standardised system behaviour instead of relying on individuals. https://www.cncf.io/reports/

This shows up clearly in – how strong teams respond to issues and when something breaks, the fix is not the end of the work. Now, the focus shifts to:

Did the system behave as expected
Can this failure happen again in a different form
Is recovery defined or dependent on someone remembering what to do

In many teams, fixes stop at resolution. In stronger teams, fixes continue until behaviour is predictable.

During our initial discussion with a fintech startup – we studied their current DevOps and Infra implementation – what all happened over past few months and where are the choking points. We found that over six months, while individual issues were different, the type of failures kept repeating:

Deployment inconsistencies
Scaling misbehaviour
Alert noise without clarity

Each was fixed multiple times, but never fully standardised. When we started converting these into defined system behaviours instead of one time fixes, their incident frequency dropped noticeably within a quarter. Not because fewer things broke, but because the system stopped reacting differently each time.

What most founders and CTOs are realising now is simple:

Hiring DevOps improves how work gets done
It does not automatically improve how systems behave
And long term stability comes from the second, not the first.

That shift is where most of the industry is slowly moving. Not towards more tooling. Not towards larger teams. But towards systems that are predictable enough that the team does not have to think about them every time something changes.

Over the last 3 to 5 years, most scaling startups have followed a similar playbook.

I am Ayesha Siddiqua – I've spent years working at the intersection of cloud infrastructure and business strategy — helping early-stage startups figure out what they actually need from their DevOps setup, and what they don't. I'm not the engineer who builds the pipelines. I'm the person who's sat across the table from founders, CTOs, and engineering leads, understood what they're trying to build, and helped them make smarter decisions about how infrastructure can accelerate — or quietly kill — their growth.

I work with Frigga Cloud Labs, where we help startups get enterprise-grade DevOps outcomes without the enterprise-sized team.

If something in this post resonated with you, I'd love to connect — whether you have a question, a war story, or just want to trade notes on what's working in the industry right now.

Connect with me on LinkedIn

Why 60% of Your Production Incidents Still Require Manual Intervention

What is changing in how better teams operate is subtle but important.

Post a Comment

Contact Form