Monitoring and observability are not the same thing. Here is what the difference actually looks like in a running system.


Managing infrastructure across AWS, GCP, and Azure means you are dealing with systems that generate telemetry from a lot of different directions at once. Metrics from CloudWatch, logs from GCP's Cloud Logging, traces from whatever the application team happened to instrument, or did not. Keeping that coherent is the actual day-to-day problem. Monitoring tools solve part of it. They watch defined thresholds and fire when something crosses them. What they do not do is give you the context to understand why something crossed, especially when the failure mode was not one you anticipated when you wrote the alert rule.

That distinction between monitoring and observability sounds like semantics until you are looking at an elevated error rate with no corresponding CPU, memory, or deployment signal to explain it. Monitoring tells you the rate is elevated. Observability tells you which specific request path, through which service, starting when. Those are different answers to a different class of question, and getting to the second one requires instrumentation that most teams have not set up.


What monitoring actually covers and where it stops

Prometheus scraping metrics, Grafana dashboards, CloudWatch alarms, Datadog infrastructure monitors: all of these are monitoring. They watch known quantities against defined thresholds. They are genuinely useful and worth having. The limit is that they can only catch problems you anticipated when you wrote the rules. A metric that has no rule does not fire. A failure mode that does not manifest as a threshold violation does not surface.

Observability is about being able to ask arbitrary questions of a running system and get answers from the data it emits. Not just "is this metric above X" but "which downstream call is responsible for the latency increase on this specific endpoint for this specific request pattern." That requires all three signal types, and it requires them to be connected through a shared context so you can move between them during an investigation without losing the thread.

The three signals and why they need to be connected

Metrics

Metrics are numeric time-series data. Request rate, error rate, latency percentiles, CPU, memory, queue depth. Fast to store, fast to query, well-suited for dashboards and alerting. The limitation is that a metric tells you a value changed but not what caused it. A p99 latency spike tells you requests are slow. It does not tell you which downstream service is adding the latency or why.

Logs

Logs are timestamped event records. They carry the context metrics do not: the specific error message, the stack trace, the user ID, the database query that timed out. The problem with logs in isolation is correlation. A log line from your API service and a log line from your database service might be part of the same request, but without a shared trace ID threading them together, you are doing manual grepping across services hoping to match timestamps. That is how investigations turn into multi-hour efforts.

Traces

Traces record the full execution path of a request across every service it touches. Each service adds a span, which is a named and timed unit of work. All spans from the same request share a trace ID that propagates via HTTP headers through every service boundary. When you look at a trace in Tempo, you see a waterfall of every operation that ran for that request, with exact timing breakdowns. That is what turns a "something is slow" alert into a "this specific database query on this specific endpoint is adding 340ms when the cache misses" answer.

When all three signals share context, a metric anomaly has an exemplar that links to a trace, and that trace links to the log lines that were emitted during it. In Grafana with the LGTM stack configured correctly, that is a three-click path from a metric spike to the log line that explains it. Without that connection, each signal sits in isolation and the investigation is manual.


OpenTelemetry: why we use it and what it actually gives you

Before OpenTelemetry, instrumenting an application for traces meant using the vendor's SDK. Datadog's tracing library, New Relic's agent, Honeycomb's SDK. Switching backends meant re-instrumenting every service. That is why teams stayed locked into their first choice regardless of whether it was the right one long-term.

OpenTelemetry is a vendor-neutral instrumentation standard. You instrument your application once using OTel SDKs, emit OTLP to a Collector, and the Collector routes to whatever backends you are using. Switching from Grafana Cloud to Datadog is a Collector config change, not a code change. On a multi-cloud setup spanning AWS, GCP, and Azure, that portability matters: you are not coupling your instrumentation to a single cloud's native observability tooling.

Auto-instrumentation covers the framework layer without code changes. For Node.js services, @opentelemetry/auto-instrumentations-node loaded before the application starts will automatically generate spans for HTTP servers, Express routes, database clients, and outbound HTTP calls. For Python, opentelemetry-instrument does the same. For Java, the agent attaches at runtime via -javaagent. That gets you trace coverage for all framework-level operations immediately.

For business-critical operations that live inside your own code, you add manual spans. Payment processing, cache lookups, async jobs, anything where you need visibility into what happened inside a specific function:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_payment(payment_id: str, amount: float):
    with tracer.start_as_current_span("payment.process") as span:
        span.set_attribute("payment.id", payment_id)
        span.set_attribute("payment.amount", amount)
        result = payment_gateway.charge(payment_id, amount)
        span.set_attribute("payment.status", result.status)
        return result

The span attributes are what make the trace useful during an investigation. Without them, you know a function ran and how long it took. With them, you know which payment ID, which amount, and what the gateway returned. That context is what surfaces in Tempo and links back to the log lines Loki collected during the same trace.


The Collector config that routes each signal to the right backend

The OTel Collector sits between your services and your backends. It receives OTLP on port 4317 over gRPC or 4318 over HTTP, processes through a configurable pipeline, and exports to one or more destinations. Running a Collector gives you batching, filtering, the ability to fan out to multiple backends, and the flexibility to swap backends without touching application code.

For a setup routing to Prometheus for metrics, Tempo for traces, and Loki for logs, this is the core Collector config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  otlphttp/tempo:
    endpoint: http://tempo:4318
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/tempo]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

The memory_limiter processor is non-negotiable in production. A telemetry spike without it will OOM the Collector. The batch processor reduces the number of network calls significantly, particularly important for log and trace pipelines at higher request volumes. In Kubernetes, run the Collector as a DaemonSet so each node has its own instance collecting from all pods on that node.


Connecting the signals inside Grafana

The LGTM stack, Loki, Grafana, Tempo, and Mimir or Prometheus, is the standard open-source combination. Grafana Cloud's permanent free tier covers 10,000 active metric series, 50GB of logs, and 50GB of traces per month, which is meaningful coverage at early stage with no operational overhead for the backends.

The correlation between signals depends on Grafana datasource configuration. Exemplars attach trace IDs to specific metric data points. When Prometheus shows a latency spike, clicking the exemplar opens the trace in Tempo directly. From the trace, derived fields on the Loki datasource let you jump to the log lines emitted during that specific trace. The provisioning config that connects these:

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name']
        filterByTraceID: true

  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - name: trace_id
          matcherType: label
          matcherRegex: trace_id
          url: '${__value.raw}'
          datasourceUid: tempo

With this in place, the path from a metric anomaly to a trace to a log line is three clicks. That is the actual value of having all three signals connected rather than sitting in separate tools.


Why starting with this from day one matters

Adding OTel instrumentation to a new service takes a few hours. Adding it to twelve services that were built without trace context propagation takes weeks, and the retrofit work competes with everything else on the roadmap. Every service built without instrumentation is a gap in the trace chain: a request goes in, something happens, and the trace has a hole where that service should be. During an incident, that hole is exactly where the investigation stalls.

The instrumentation does not need to be complete on day one. Auto-instrumentation for framework-level traces plus manual spans on the five or six operations that matter most to the business is enough to make a meaningful difference. The Collector and backends can be Grafana Cloud's free tier to start. The point is to build the habit into the development process rather than treating it as a 

future project.

Author note

Ayesha Siddiqua & Manjunaathaa

Manjunaathaa is an Associate DevOps Engineer at Frigga Cloud Labs. He manages infrastructure across AWS, GCP, and Azure, deploys through GitHub Actions, and has spent a significant amount of time working through exactly the problem this blog is about: the gap between having monitoring set up and actually being able to answer why something broke. His focus is Proactive Resilience, building observability feedback loops that improve the infrastructure itself, not just watch it. The LGTM stack with OpenTelemetry is what he works with daily to make that real. I work with founding teams and CTOs through Frigga Cloud Labs, a DevOps consultancy built specifically for growing startups, and the technical depth in this blog reflects Manju's direct experience working inside these systems.

Let's connect on LinkedIn → Ayesha Manjunaathaa


Post a Comment

Previous Post Next Post