Why does your $100/month alerting tool still need a WhatsApp message to actually work?




Production went down at 11:02pm on a Thursday. The alert fired correctly. It went to Slack, as configured. The first person to notice something was wrong was a customer, who sent a support ticket at 11:43pm. The engineer saw the Slack notification at 11:47, after finishing dinner.

Forty-five minutes. Transactions failing. Customers affected. And somewhere in a monitoring dashboard, a green checkmark showing the alert was successfully delivered.

Almost every CTO I speak to in Bengaluru, Mumbai, or Hyderabad has this story, or a version of it. The monitoring worked. The pipeline worked. The tooling worked exactly as designed. And none of that was enough, because the real gap was never the alert. It was whether someone actually saw it and took ownership in time.


The tool your team is paying for was not built for your team

PagerDuty and OpsGenie were designed for a different market

PagerDuty was founded in 2009 in San Francisco. OpsGenie was built in the US, acquired by Atlassian, and stopped accepting new accounts in mid-2025. Both were designed around a communication reality where phone calls, SMS, and email are the primary escalation channels, with Slack as the dominant modern layer. WhatsApp is either absent or available as a third-party integration, with separate configuration, separate maintenance, and separate cost.

For a startup with a six-person engineering team in India, PagerDuty's Team plan runs roughly $19 per user per month. That is over $100 a month, billed in dollars, for a tool whose default behaviour does not match the channel your engineers will actually respond to fastest. Teams pay for the tool, configure the Slack integration, and then the engineers still end up messaging each other on WhatsApp when something breaks at midnight, because that is simply where they are.

Delivered and acknowledged are not the same thing

This is the distinction most alerting systems quietly ignore. Delivery means the alert was sent. Acknowledgement means a specific human has seen it, taken ownership, and started working on it. Those are two entirely different events, and the gap between them is where incidents become expensive.

Mean Time to Acknowledge, the metric Atlassian defines as the average time between an alert firing and a human beginning to work on it, is the number that determines how bad a production incident actually gets. A system that delivers alerts reliably but cannot guarantee acknowledgement is a system that tells you something broke. It does not ensure anyone is fixing it.

Slack showed the alert arrived at 11:02. The engineer acknowledged at 11:47. The monitoring system logged a successful delivery. In those 45 minutes, nobody was accountable, and nothing in the official tooling noticed the silence.


Why WhatsApp is not a preference. It is where attention lives in India.

The numbers behind the behaviour

94% of WhatsApp's Indian monthly user base opened the app daily as of late 2025, according to Sensor Tower data. Not weekly. Not when they get around to it. Daily, multiple times, across every context: personal, professional, family. A message arriving on WhatsApp does not compete with 47 Slack notifications and three PR review requests. It arrives in a space where the reflex to check and respond is fully trained.

This is not about WhatsApp being popular. It is about where attention is reliably concentrated. When production breaks at midnight and a response is needed in under five minutes, the channel with the highest probability of being seen immediately is not the one with the most enterprise features. It is the one where the engineer already is.

The real incident response system is already running. It is just not official.

Ask almost any CTO at an Indian startup how incidents actually get resolved at night. The honest answer almost always involves someone messaging the on-call engineer on WhatsApp, calling if there is no reply, and escalating manually from there. This is not a gap in the system. It is the actual system, running informally in parallel to the official one.

It works, until it does not. It breaks when the person willing to escalate manually is unavailable, asleep, or simply does not notice the silence quickly enough. There is no defined path, no audit trail, and no consistency. The outcome of any given midnight incident depends heavily on who happens to be awake and paying attention.

The informal WhatsApp chain that resolves most midnight incidents is not a workaround. It is the real system. The question is whether it should keep depending on whoever is willing to stay awake, or whether it should be designed to work regardless.


What a WhatsApp-first alerting system actually looks like

The architecture that changes the outcome

Every alerting tool available today was built with Slack, SMS, or email as the primary channel, and WhatsApp added later as an integration if at all. When WhatsApp is a plugin, it gets delivery. The acknowledgement logic, the escalation engine, and the response confirmation are all still designed around the original primary channel.

What is different about an alerting system built with WhatsApp as the foundation rather than an afterthought is where the intelligence lives. The engineer receives the alert on WhatsApp. They acknowledge directly from that interaction, no separate app, no dashboard login, no context switch. The system registers ownership the moment acknowledgement arrives. Until that moment, it does not wait for someone to notice the silence. It escalates automatically.

Escalation that does not require a human to notice

If the primary on-call engineer does not acknowledge within a defined window, the escalation happens without anyone deciding to trigger it. The path moves to the secondary, then to the team lead, then to the CTO, on a timer, at 2am on a Sunday, regardless of who is awake or whether anyone has noticed the original message went unanswered.

This is the part of informal WhatsApp workflows that breaks down most consistently. The escalation depends on a person noticing the silence and choosing to act. A properly designed system replaces that dependency with a timer. The outcome stops varying based on who happens to be paying attention.

It sits on top of what the team already uses

This layer does not replace Prometheus, Grafana, or Datadog. It connects to them via webhook, routes alerts by severity, and delivers through WhatsApp with the escalation logic on top. The monitoring investment the team has already made continues to function as intended. What gets added is the guarantee that a human responds, which is the part those tools were never designed to provide.


How to know whether your current setup has this gap

One diagnostic that takes five minutes

Look at the last five production alerts that fired outside business hours. For each one, find the time the alert fired and the time the first engineer began actively working on the incident. That gap is the real Mean Time to Acknowledge for the team. If it is consistently above ten minutes, the delivery layer is working and the acknowledgement layer is not. The monitoring system did its job. The gap between the alert and the human response is where the problem lives.

The fix is not more alerts or louder notifications. It is an alerting system designed around the channel where the team's attention is actually concentrated, with escalation that runs on a clock rather than on someone's awareness. For most Indian engineering teams, that means WhatsApp at the centre, not the periphery.


This is what we built Shankh to solve. Not a new monitoring stack, not another dashboard, not a replacement for what the team already has. A layer that sits between the alerts that already fire and the engineers who need to respond to them, designed specifically for the way Indian teams actually communicate. The informal system that already exists in every team's chat history, made consistent, auditable, and reliable enough to trust at midnight. If your team is already solving this manually on WhatsApp, the question worth asking is whether that should keep depending on goodwill or whether it should be built to work every time.

Ayesha Siddiqua

Working with startup teams across India, the pattern I keep seeing is the same: the monitoring is set up, the alerts are configured, and the real incident response happens over WhatsApp in a group that has no escalation path and no audit trail. At Frigga Cloud Labs, we built Shankh specifically because this gap is structural, not a habit problem. If your 11pm incident response depends on who is awake, it is worth looking at what a system would look like.

:paperclip: Let's connect on LinkedIn




Post a Comment

Previous Post Next Post