Observability vs. monitoring: why startups get this wrong, and what it costs them when they do.


Here is a situation most engineering teams have lived through. An alert fires: API error rate is above 5%. The on-call engineer opens the dashboard. Error rate is elevated, confirmed. They look at CPU, memory, and database connection counts. All normal. They check the deployment log. Nothing shipped in the last six hours. The alert is real. The cause is invisible. The engineer spends forty minutes grepping through logs from three different services before finding the culprit: a third-party payment provider started returning a new error code that none of the existing log parsers recognised, so the errors were being silently swallowed by the service layer and surfacing only as a generic 500 upstream.

The monitoring system did its job. The alert fired at the right threshold. What it could not do was answer the only question that actually mattered in that moment: why is this happening, and in exactly which service, on exactly which request path, starting from exactly when?

That is the gap between monitoring and observability. It sounds like a semantic distinction. In production, it is the difference between a fifteen-minute resolution and a two-hour postmortem.


What monitoring actually is, and what it cannot do

Monitoring is the practice of watching for known problems

Monitoring works by defining what normal looks like and alerting when that definition is violated. CPU above 80%. Error rate above 2%. Response time above 500 milliseconds. These are thresholds set by humans based on their current understanding of the system. When a metric crosses a threshold, the system fires. When everything is within bounds, the system is quiet.

This model is powerful for the problems it was designed to detect: infrastructure saturation, known failure modes, predictable degradation patterns. It is completely blind to what OpenTelemetry's observability primer describes as unknown unknowns, problems that were not anticipated when the thresholds were set, because no threshold exists for them. The payment provider returning a new error code was an unknown unknown. No threshold covered it. No alert was configured for it. The monitoring system had no way of knowing it was the problem until a human figured it out manually.

Most startups have monitoring. Almost none have observability.

Setting up Prometheus to scrape metrics and Grafana to display them is monitoring. Setting up CloudWatch alarms on Lambda error rates is monitoring. Configuring Datadog to alert on CPU is monitoring. All of these are genuinely valuable and worth having. None of them, individually or together, constitute observability. What they share is that they all answer a single class of question: is this metric within the expected range? Observability answers a different class of question entirely: why is the system behaving the way it is right now?

Splunk's State of Observability 2025 report, based on 1,855 engineering professionals across nine countries, found that 73% of organizations had experienced outages directly linked to ignored or suppressed alerts. That number is not a monitoring failure. It is an observability failure: too much noise, too little context, and no way to correlate a firing alert with the specific request path and service interaction that caused it.


The three pillars, explained plainly

Metrics: numbers over time

Metrics are numeric measurements collected at regular intervals. Request rate, error rate, latency, CPU utilization, memory usage, queue depth. They are efficient to store, fast to query, and well-suited to answering trend questions: is this getting worse? Is this above normal? Metrics are the foundation of monitoring, and they are genuinely useful. Their limitation is that they tell you something changed, but not why it changed or where in the system the change originated.

Logs: events with context

Logs are timestamped records of things that happened. A user logged in. A database query failed. A function threw an exception. Logs contain the context that metrics lack: the specific error message, the user ID, the query that failed, the stack trace. The problem with logs is that they do not connect to each other. A log from Service A and a log from Service B might be part of the same user request, but without a shared identifier threading them together, you cannot see the relationship. You see two events in isolation, not a story.

Traces: the full path of a request

This is the pillar most startups are missing entirely. A trace records the complete journey of a single request as it moves through your system: from the API gateway to the auth service to the database query and back. Each step in that journey is a span, and spans are stitched together by a shared trace ID that flows through every service the request touches. When API response times degrade, a trace shows you not just that they degraded, but exactly which service call added the latency, on which specific request pattern, starting from which point in time.

As Dash0's explanation of observability signals puts it, a log entry on its own tells you something failed. That same log, linked to the trace that triggered it, shows you the exact request and execution path involved. When the metrics from that request are also connected, you can see how often the problem occurs, how severe it is, and whether it is isolated or systemic. Individually, each pillar is useful. Together, with shared context threading them, they are a completely different capability.

Monitoring tells you when something broke. Observability tells you why, where, and since when. Most startups have the first. Almost none have invested in the second until after their first serious incident.


OpenTelemetry: why it matters and what it solves

The problem before OpenTelemetry existed

Before OpenTelemetry, instrumenting an application for observability meant choosing a vendor and using their proprietary SDK. Datadog had its own tracing library. New Relic had its own agent. Honeycomb had its own SDK. If you instrumented your application for Datadog and then wanted to evaluate Honeycomb, you had to re-instrument everything. Every vendor change meant weeks of engineering work, and teams ended up locked into whatever they chose first because the cost of switching was too high.

What OpenTelemetry changes

OpenTelemetry is an open-source, vendor-neutral framework for generating, collecting, and exporting telemetry data. It provides SDKs for over twelve languages, automatic instrumentation for common frameworks and libraries, and a Collector that receives telemetry from your services and routes it to any backend. You instrument your application once, using OpenTelemetry APIs. You then point the exporter at whatever backend you choose: Grafana Cloud, Datadog, Honeycomb, Jaeger, or your own infrastructure. Switching backends becomes a configuration change, not an engineering project.

By 2026, OpenTelemetry has become the second most active CNCF project after Kubernetes, supported by every major observability vendor. The practical implication for a startup is straightforward: if you are not yet instrumenting for observability, start with OpenTelemetry. You will not be locked into any vendor, and you will not have to re-instrument when you outgrow your first backend choice.


Why observability is a product decision, not just an ops decision

The engineering team's visibility problem is also a product team's roadmap problem

The most important insight about observability that most engineering teams miss is that it answers business questions, not just technical ones. Which feature has the highest error rate in production? Which API endpoint has the worst p99 latency and is it correlated with a specific user cohort? When did checkout success rates start declining, and which service change coincided with it? These are questions that product managers and engineering managers need answered to make roadmap decisions. Without traces and correlated logs, those questions either go unanswered or get answered weeks late in a postmortem.

Splunk's 2025 research found that 65% of engineering organizations say their observability practice positively impacts revenue, and 64% say it positively impacts product roadmaps. Those numbers are not describing ops teams keeping the lights on. They are describing engineering teams using production telemetry to make product decisions faster and with more confidence.

Observability changes when you can ship with confidence

A team with proper observability can deploy on a Friday afternoon because they can see in real time whether the new code is behaving differently from the old code. Error rates, latency distributions, and trace patterns for the new version appear immediately after deployment. If something is wrong, the trace shows exactly what is wrong and in which service. The rollback decision is made on evidence, not anxiety.

A team without observability deploys on Tuesday morning, with senior engineers available, with manual checks in place, with a ceremony built around deployment anxiety. That ceremony is not caution. It is the operational cost of not being able to see what your system is actually doing.

The cost of deferring this is higher than it looks

Most startup engineering teams defer observability with the same logic as everything else: we will do it properly once we are bigger. The compounding problem is that every service added to a system without proper instrumentation becomes a permanent blind spot. Retrofitting distributed tracing into a system of twelve services that were built without trace context propagation is a weeks-long engineering project. Building it in from the start, using OpenTelemetry's automatic instrumentation for standard frameworks, takes an afternoon. The asymmetry between the cost of doing it early and the cost of doing it late is significant, and it only widens as the system grows.


The minimum viable observability stack for a startup in 2026 is not complex or expensive. OpenTelemetry for instrumentation. Grafana Cloud's permanent free tier for metrics, logs, and traces. That combination covers all three pillars, costs nothing at early stage, and leaves the door open to any backend choice as the team grows. The argument for doing this before you think you need it is simple: the first time you need a trace to debug a production issue and you do not have one, you will spend hours doing manually what a trace would have shown in thirty seconds. That incident is coming. The question is whether your system is ready to answer the question.

Ayesha Siddiqua

I sit at the crossroads of cloud infrastructure and startup growth, and over time, that has put me in a lot of honest conversations with Heads of Engineering who were convinced their system was observable because the dashboards were green, right up until the moment a customer found the bug before they did. I am part of the team at Frigga Cloud Labs, a DevOps consultancy built specifically for growing startups. If your team is spending hours in a postmortem tracing a problem that a single distributed trace would have surfaced in thirty seconds, that gap is worth closing before the next incident.
:paperclip:
 Let's connect on LinkedIn

Post a Comment

Previous Post Next Post