Observability Engineering: Logs, Metrics, and Traces

Observability is the degree to which you can understand the internal state of a system from its external outputs. A system with good observability is one where, when something goes wrong, you can figure out what happened and why — ideally without needing to deploy new code or reproduce the problem locally.

The three main building blocks of observability are logs, metrics, and traces. They serve different purposes and answer different questions.

Logs: what happened

Logs are timestamped records of events. They are the most familiar form of observability — print statements that write to disk or a log aggregator.

Good logs are:

Structured: JSON or another parseable format, not free-text strings. Structured logs are queryable — you can filter by user_id, error_code, or request_id without string parsing.

{
  "timestamp": "2025-01-15T10:23:45Z",
  "level": "error",
  "service": "payment-api",
  "request_id": "req_abc123",
  "user_id": "usr_xyz789",
  "event": "payment_failed",
  "reason": "insufficient_funds",
  "amount_cents": 4999
}

At appropriate levels: DEBUG for development details, INFO for normal operations, WARN for unexpected-but-handled situations, ERROR for failures that need attention.

Correlated with request IDs: Every request should have a unique ID that appears in every log line generated during that request. This lets you trace all logs for a single request through the system.

Metrics: how the system is performing

Metrics are numerical measurements over time. Unlike logs (which record individual events), metrics aggregate and summarize behavior.

Key metric types:

Counters: Monotonically increasing counts. Total requests, total errors, total bytes processed. Useful for calculating rates (errors per minute).

Gauges: Current values that go up and down. CPU usage, active connections, queue length.

Histograms: Distributions of measurements. Latency histograms allow computing percentiles (p50, p95, p99) — more useful than averages, which hide tail latency.

The most important metrics for most services:

Request rate: Requests per second, by endpoint and status code
Error rate: Percentage of requests that resulted in an error
Latency: p50, p95, p99 latency by endpoint
Saturation: How much capacity is being used (CPU, memory, queue depth)

These four types (Rate, Errors, Duration, Saturation) are sometimes called the RED and USE methods.

Traces: how a request flowed

In a distributed system, a single user request may touch many services. A trace records the entire journey of that request — which services were called, in what order, with what duration.

A trace consists of spans: one span per operation. The root span represents the top-level request. Child spans represent downstream calls. Each span has:

A name
A start time and duration
Tags (metadata about the operation)
A parent span ID (linking it to the trace)

Tracing makes it possible to answer questions like "why is this endpoint slow?" when the answer is "because service B, which it calls, has increased latency in its database queries."

OpenTelemetry is the emerging standard for instrumenting applications to produce traces (and metrics and logs). Most languages have SDKs, and most observability platforms can ingest OpenTelemetry data.

Putting it together

Logs, metrics, and traces work best together:

Metrics tell you something is wrong (error rate spiked, latency increased)
Traces tell you where the problem is (which service or operation is slow)
Logs tell you what happened (the specific error message, the relevant context)

A common workflow when debugging:

Alert fires (metric threshold exceeded)
Look at the dashboard to confirm the scope (which endpoints? which regions?)
Sample traces to identify the slow or failing operation
Examine logs for that operation to understand the specific failure

Alerting

Observability without alerting means you only know about problems when users tell you. Alerts should be:

Actionable: Every alert should correspond to something a human can and should do. Alerts that fire without a clear response become noise that gets ignored.

Symptom-based, not cause-based: Alert on high error rates (what users experience) rather than specific error codes (which may have many causes). Investigate causes after the alert fires.

Set on appropriate percentiles: Alerting on average latency misses problems that affect a minority of users. Alert on p99 or p95 to catch tail latency issues.

The cost of observability

Observability infrastructure costs money: storage for logs and traces, ingestion costs, query costs. At high volume, these can be significant.

Practical cost management:

Sample traces (collect 1% or 10% of traces rather than 100% for normal traffic; collect 100% for errors)
Set log retention policies — not everything needs to be kept for years
Use metric aggregation instead of storing every data point

Summary

Observability requires three complementary signals: logs (individual events), metrics (aggregated numerical measurements), and traces (end-to-end request flows). Structured, correlated logs enable efficient debugging. Key metrics — rate, errors, latency, saturation — indicate system health. Distributed traces locate problems in complex systems. Good alerting is symptom-based and actionable. OpenTelemetry provides a vendor-neutral standard for instrumentation.

Observability Engineering: Logs, Metrics, and Traces

Logs: what happened

Metrics: how the system is performing

Traces: how a request flowed

Putting it together

Alerting

The cost of observability

Summary

More Intelligence

Git Workflows That Work for Teams

Kubernetes for Developers: The Core Concepts

API Design Principles That Hold Up Over Time