homeblogbeliefscontact
  • rss

  • github

© 2026 MIT Licensed

Observability Is Not Just Monitoring With Extra Steps

Nov 8, 2025
observabilitymonitoringsre

You have Grafana dashboards covering every service. Prometheus is scraping metrics from 200 endpoints. Your ELK stack ingests terabytes of logs daily. PagerDuty is configured with escalation policies. You have a monitoring stack. You do not have observability.

This is not pedantry. The distinction between monitoring and observability has real implications for how you debug production issues, how fast you recover from incidents, and how well you understand the systems you operate.

The Fundamental Difference

Monitoring tells you when something is broken. It answers known questions: Is the CPU above 80%? Is the error rate above threshold? Is the response time degrading? These are questions you anticipated and built dashboards for.

Observability tells you why something is broken. It lets you ask arbitrary questions about your system's behavior — questions you did not anticipate when you built the dashboards. It is the difference between a check engine light and a mechanic's diagnostic computer.

Here is the practical test: when you get paged at 3 AM for a new, never-before-seen issue, can you diagnose it from your current tooling without deploying additional instrumentation? If the answer is no, you have monitoring but not observability.

The Three Pillars (And Why They Are Not Enough)

You have probably heard observability described as three pillars: metrics, logs, and traces. This framing is useful but incomplete.

Metrics: The What

Metrics tell you what is happening at an aggregate level. They are cheap to store, fast to query, and excellent for alerting and dashboards.

# Good: SLI-based metrics that answer business questions
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Better: With meaningful dimensions
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="checkout"}[5m]))

Where metrics fall short: They are pre-aggregated. You decided at instrumentation time what dimensions to track. If a problem manifests across an unexpected combination of dimensions — say, a specific user agent on a specific API version in a specific region — your metrics might not have the cardinality to reveal it.

Logs: The Context

Logs provide detailed context about individual events. They are essential for understanding what happened in a specific request or transaction.

{
  "timestamp": "2026-01-15T14:23:19Z",
  "level": "error",
  "service": "checkout-service",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "user_42",
  "message": "Payment processing failed",
  "error": "timeout after 30s",
  "payment_provider": "stripe",
  "amount_cents": 4999,
  "currency": "USD",
  "retry_count": 3
}

Where logs fall short: At scale, searching through logs is slow and expensive. Log storage costs grow linearly with traffic. And unstructured logs — the kind most applications produce — are nearly useless for debugging complex issues.

Traces: The Journey

Distributed traces show you the path a request takes through your system, including timing for each hop:

[api-gateway] 2ms
  └─[auth-service] 15ms
  └─[checkout-service] 450ms
      └─[inventory-service] 12ms
      └─[payment-service] 420ms  ← bottleneck
          └─[stripe-api] 418ms  ← root cause

This immediately tells you that the checkout latency is caused by slow Stripe API calls, not your own services. Without traces, you might spend hours looking at the wrong service's metrics and logs.

Where traces fall short: Full-fidelity tracing at scale is expensive. Most organizations sample traces, which means you might not capture the exact request that caused the problem. And traces alone do not tell you whether the behavior is normal or anomalous.

Beyond the Three Pillars

The real power of observability comes from correlating across all three signals — and adding additional context:

  • Profiles: CPU and memory profiles that show where your code spends time
  • Events: Deployments, config changes, feature flag toggles
  • Real user monitoring: What are actual users experiencing?

The question is not "do I have metrics, logs, and traces?" but "can I pivot between these signals seamlessly to answer arbitrary questions?"

OpenTelemetry: The Instrumentation Standard

If you are starting an observability initiative today, OpenTelemetry (OTel) is the answer to "how do I instrument my code?"

Why OpenTelemetry Matters

Before OTel, every observability vendor had their own instrumentation SDK. Switching from Datadog to Honeycomb meant re-instrumenting your entire codebase. This vendor lock-in was expensive and created real barriers to adopting better tooling.

OTel provides a single, vendor-neutral standard for generating telemetry data. Instrument once, send to any backend.

// OpenTelemetry instrumentation in Node.js
import { trace, metrics, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('checkout-service');
const meter = metrics.getMeter('checkout-service');

const checkoutCounter = meter.createCounter('checkout.attempts', {
  description: 'Number of checkout attempts',
});

const checkoutDuration = meter.createHistogram('checkout.duration', {
  description: 'Checkout processing duration in ms',
  unit: 'ms',
});

async function processCheckout(order: Order): Promise<CheckoutResult> {
  return tracer.startActiveSpan('processCheckout', async (span) => {
    const startTime = Date.now();

    span.setAttributes({
      'checkout.order_id': order.id,
      'checkout.item_count': order.items.length,
      'checkout.total_cents': order.totalCents,
    });

    checkoutCounter.add(1, { 'checkout.payment_method': order.paymentMethod });

    try {
      const result = await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      checkoutDuration.record(Date.now() - startTime, {
        'checkout.payment_method': order.paymentMethod,
      });
      span.end();
    }
  });
}

Auto-Instrumentation: Start Without Code Changes

OTel provides auto-instrumentation for most languages that captures HTTP requests, database queries, and framework-specific spans without modifying application code:

# Node.js auto-instrumentation
npm install @opentelemetry/auto-instrumentations-node

# Python auto-instrumentation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

Auto-instrumentation gets you 80% of the value with near-zero effort. Manual instrumentation adds the remaining 20% — business-specific context like order values, user tiers, and feature flags.

The Collector: Your Telemetry Pipeline

The OpenTelemetry Collector is a vendor-neutral telemetry pipeline that receives, processes, and exports data:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Add environment context to all telemetry
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: team
        value: platform
        action: upsert

  # Tail-based sampling for traces
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/primary:
    endpoint: "tempo.monitoring:4317"
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling]
      exporters: [otlp/primary]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]

The collector pattern is powerful because it decouples your applications from your observability backend. Switch vendors by changing the exporter configuration, not your application code.

Structured Logging: Logs That Actually Help

Most application logs look like this:

2026-01-15 14:23:19 ERROR Failed to process payment for order 12345

This is human-readable but machine-hostile. When you have thousands of these per second, grep is not a debugging strategy.

Structured Logging Done Right

Every log line should be a structured event with consistent fields:

{
  "timestamp": "2026-01-15T14:23:19.456Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "checkout-service",
  "version": "2.4.1",
  "environment": "production",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "user_42",
  "order_id": "order_12345",
  "payment_method": "credit_card",
  "error_type": "TimeoutError",
  "error_message": "Connection to payment provider timed out after 30000ms",
  "retry_count": 3,
  "duration_ms": 30045
}

Now you can query: "Show me all payment failures for credit card transactions where retry count exceeded 2 in the last hour." That is observability — asking arbitrary questions your dashboards were not designed to answer.

Correlation IDs: Connecting the Dots

The most important field in any log line is the trace ID. This single string connects a log entry to a distributed trace, which connects to metrics, which connects to the specific deployment that introduced the issue.

// Middleware that propagates trace context to logs
import { context, trace } from '@opentelemetry/api';

function getTraceContext() {
  const span = trace.getActiveSpan();
  if (!span) return {};

  const spanContext = span.spanContext();
  return {
    trace_id: spanContext.traceId,
    span_id: spanContext.spanId,
  };
}

// Every log call automatically includes trace context
logger.error('Payment processing failed', {
  ...getTraceContext(),
  order_id: order.id,
  error: err.message,
});

SLOs: The Bridge Between Observability and Business

Service Level Objectives (SLOs) transform observability data into business-relevant information. Instead of "the p99 latency is 450ms," you get "we have consumed 40% of our error budget this month and are on track to breach our SLO."

Defining Meaningful SLIs

Service Level Indicators (SLIs) should reflect what users actually care about:

Availability SLI: The proportion of requests that succeed.

sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))

Latency SLI: The proportion of requests that complete within an acceptable threshold.

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))

Correctness SLI: The proportion of operations that produce correct results.

sum(rate(data_processing_results_total{result="correct"}[30d])) / sum(rate(data_processing_results_total[30d]))

Error Budgets: Making Trade-offs Explicit

If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes of downtime per month. This budget is shared between:

  • Planned maintenance
  • Deployment rollouts
  • Unplanned incidents
  • Performance degradation

When error budget is healthy, ship features aggressively. When error budget is depleted, prioritize reliability work. This creates a natural, data-driven tension between velocity and stability.

Common Observability Anti-Patterns

Dashboard Overload

Having 200 dashboards means nobody looks at any of them. Curate a small number of service-level dashboards that answer: "Is this service healthy?" Everything else should be discoverable through ad-hoc queries.

Alert on Everything

If your team gets more than 5 actionable alerts per on-call shift, your signal-to-noise ratio is destroying effectiveness. Alert on symptoms (SLO burn rate), not causes (CPU usage).

Observability as Afterthought

Bolting observability onto an existing system is 10x harder than building it in from the start. If you are designing a new service, instrumentation should be part of the design, not a post-launch task.

Sampling Too Aggressively

Head-based sampling (deciding at the start of a request whether to trace it) means you will miss interesting traces. Tail-based sampling (deciding after the request completes based on its characteristics) captures errors and slow requests reliably.

Ignoring Costs

Observability data is expensive to store and query. A naive "collect everything" approach can easily cost more than the infrastructure you are monitoring. Be intentional about what you collect, how long you retain it, and how you sample.

The Observability Maturity Model

Level 1: Basic Monitoring

  • Infrastructure metrics (CPU, memory, disk)
  • Application health checks
  • Basic alerting on thresholds
  • Centralized log aggregation

Level 2: Service-Level Observability

  • Request-level metrics (latency, error rate, throughput)
  • Structured logging with correlation IDs
  • Basic distributed tracing
  • SLIs defined for critical services

Level 3: Full Observability

  • High-cardinality event data for ad-hoc analysis
  • Trace-driven debugging workflows
  • SLOs with error budgets driving prioritization
  • Continuous profiling for performance optimization
  • Automated anomaly detection

Level 4: Proactive Observability

  • ML-driven anomaly detection and root cause analysis
  • Automated remediation for known failure patterns
  • Chaos engineering informed by observability data
  • Business KPIs directly derived from telemetry

Most organizations are between Level 1 and Level 2. Getting to Level 3 is a realistic and high-impact goal. Level 4 is aspirational for most teams but represents where the industry is heading.

Getting Started

If you are starting from scratch or upgrading from basic monitoring, here is the pragmatic path:

  1. Instrument with OpenTelemetry. Deploy auto-instrumentation first, then add manual spans for business-critical paths. This gives you metrics and traces with minimal effort.

  2. Adopt structured logging. Convert your existing log statements to structured JSON with trace context. This is the highest-effort, highest-value change.

  3. Define SLOs for your top 3 services. Start with availability and latency SLIs. Use error budgets to drive engineering prioritization.

  4. Build one great service dashboard. Not twenty dashboards — one per critical service that answers "is it healthy?" with SLO burn rates, error rates, and latency distributions.

  5. Practice ad-hoc debugging. During your next incident, deliberately avoid your pre-built dashboards. Can you diagnose the issue using trace search and log queries alone? If not, you know where your observability gaps are.

The goal is not to collect more data. It is to be able to answer any question about your system's behavior, at any time, without deploying new code. That is observability.

Related Posts

Monitoring and Alerting: The Art of Knowing When Things Break

Jun 18, 2025

Incident Response That Actually Works: Beyond the Runbook

Feb 15, 2026

Site Reliability Engineering: Building Systems That Scale and Survive

Mar 15, 2024