observability
Decide what to instrument with logs, metrics, and traces so a system is debuggable in production. Use when the user asks about observability, monitoring, logging, metrics, tracing, alerts, or how to make a service debuggable.
You're observable if a stranger on-call at 3 AM can answer "what's broken and why" from the data alone, without pinging the author. That's the whole goal.
When to use
- "Add logging / metrics / tracing"
- "How should I instrument X?"
- "Make this service observable"
- "Set up alerts"
- "Debug production without a debugger"
The three pillars (and what they're each good at)
| Signal | Answers | Cost |
|---|---|---|
| Logs | "What happened?" — discrete events, full detail | High volume; cheap per event, expensive at scale |
| Metrics | "How much / how often?" — aggregated, low-cardinality, time-series | Cheap, fast to query, lossy on details |
| Traces | "Where did time go?" — request flow across services | Sampled; great for latency and dependency analysis |
You need all three. Each compensates for the others' weaknesses.
Logs
What to log
- Boundary events: request received, request completed (status, latency)
- Business events: order placed, user registered, payment captured
- Errors with full context (stack, inputs, user/account ID, request ID)
- State transitions that matter
- External calls: which service, latency, outcome
- Background job: started, completed, retried, failed
What NOT to log
- Secrets, tokens, passwords, full credit card numbers
- Personally identifiable info beyond what's necessary (and never in plaintext if you can avoid it)
- Inside hot loops without throttling
- "Success" pings every second from a healthy heartbeat — too much noise
- Anything you wouldn't want screen-shared in a debugging session
Structure them
Structured logs (JSON) > text logs. They're greppable and queryable.
{
"ts": "2026-05-22T14:30:00.123Z",
"level": "info",
"msg": "order.placed",
"request_id": "req_01HX...",
"user_id": "usr_42",
"order_id": "ord_7",
"amount_cents": 1999,
"currency": "USD",
"duration_ms": 42
}
Rules:
- One event per line, valid JSON
- Stable field names across the codebase (
request_id, notreqIdhere andrequestIdthere) tsin ISO 8601 UTClevelfrom a fixed set:debug,info,warn,errormsgis a stable identifier, not a sentence —order.placednot"Order placed for user 42 at $19.99". Put dynamic values in their own fields.
Levels
| Level | Use for | Production volume |
|---|---|---|
debug |
Detailed flow info | Off in prod (or sampled) |
info |
Normal events worth recording | High but bounded |
warn |
Unexpected but handled | Low |
error |
Failures requiring attention | Very low, every one investigated |
If error is firing constantly, it's not error — it's noise. Either fix the cause or change the level.
Metrics
The four golden signals (for any user-facing service)
- Latency — how long requests take (p50, p95, p99, never average)
- Traffic — how many requests (RPS)
- Errors — error rate (% or absolute)
- Saturation — how close to capacity (CPU, memory, queue depth, connection pool usage)
Instrument these four for every service. Almost everything else follows.
Naming
- Snake_case, dot-separated namespace:
http.request.duration_ms,payment.captured.count - Include the unit in the name:
_ms,_bytes,_count,_ratio,_total - Counter names end in
_totalif cumulative (Prometheus convention) - Histograms for distributions (latency), counters for events, gauges for current state
Labels / tags / dimensions
Add labels for dimensions you'll filter by: method, route, status_code, region.
Cardinality is the silent killer. Each unique combination of label values creates a separate time series. Don't put high-cardinality values in labels:
- ❌
user_id— millions of unique values, kills your metrics backend - ❌
request_id,order_id, exact URLs with path params - ✅
user_tier(free / pro / enterprise),route(parameterized:/users/:id)
If you need per-user investigation, that's what logs and traces are for. Not metrics.
Histograms, not averages
Average latency: 150ms
p50: 80ms, p95: 400ms, p99: 2000ms, max: 30s
The first number is comforting. The other four describe what users actually experience. Always histogram for latency.
Traces
When to add tracing
- Multi-service request flows (frontend → API → downstream → DB → cache)
- Background jobs with multiple stages
- Anywhere "where did the time go?" is asked frequently
- Async workflows: connect the trace across queues
What a trace tells you
[─────────── HTTP GET /api/orders/42 (320ms) ──────────────]
[── auth (5ms) ──]
[─────── DB: SELECT orders (180ms) ──────]
[── DB: SELECT items (40ms) ──]
[── cache get (1ms) ─]
[─── external: shipping API (80ms) ───]
You see where time goes and what's in the critical path. No more "is the DB slow or the external API slow?" guesswork.
Sampling
Tracing every request is expensive and rarely useful. Sample:
- Head-based: decide at the start (simple, may miss interesting traces)
- Tail-based: collect all, sample based on outcome (slow / errored / unusual) — keeps the interesting ones
- Always sample errors at 100%
Correlation IDs
A request_id (or trace_id) on every log line + propagated to downstream services + included in error responses lets you stitch everything together. This is the single highest-leverage observability investment.
Frontend log: request_id=req_01HX... user clicked /checkout
Gateway log: request_id=req_01HX... routed to api-service
API log: request_id=req_01HX... ERROR validation failed
Customer email: support reference: req_01HX...
One ID, the whole story.
Alerts — make them actionable
An alert should answer what's wrong + what to do.
Good alerts
- Fire on user-visible symptoms, not internal causes ("error rate > 1% for 5 min" not "CPU > 80%")
- Include a runbook link
- Have an owner
- Are tested (synthetic failure to verify they fire)
Alert fatigue is the failure mode
If on-call ignores or auto-silences alerts, your alerts are broken. Symptoms:
- Alerts that fire and recover before anyone sees them
- "It always fires, just ignore it"
- Multiple alerts for the same underlying cause
- Pages outside business hours for non-pageable severity
Fix the alerts, not the humans.
Alert template
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: page
service: api
owner: payments-team
annotations:
summary: "API error rate above 1% for 5 minutes"
description: "Current rate: {{ $value | humanizePercentage }}"
runbook: "https://runbooks/api/high-error-rate"
dashboard: "https://grafana/d/api-overview"
Dashboards — one per service, no more
Every service should have one canonical dashboard answering:
- Is it up? (traffic + error rate)
- Is it fast? (latency percentiles)
- Is it healthy? (saturation)
- What's it doing? (top routes, top errors)
More dashboards ≠ better observability. They scatter attention. One good dashboard, kept up to date, beats twelve neglected ones.
What to instrument first (priority order)
If you have nothing yet:
- Request log with
request_id, status, latency, route — on every endpoint - Four golden signals as metrics
- Error tracker (Sentry / Honeybadger / Rollbar) — uncaught errors with stack + context
- One canonical dashboard + one runbook
- Alert on error rate and alert on p95 latency
- Trace the top 1–2 user journeys end-to-end
- Saturation alerts: disk, memory, queue depth
That gets you 80% of the value. Add more only when a real debugging session was hampered by missing data.
Output format
When reviewing instrumentation:
## Observability review: <service / scope>
**Current signals:** logs / metrics / traces / alerts
### What's instrumented
- ✅ <thing> — <how>
### Gaps
- 🔴 <missing critical instrumentation> — <user impact when debugging>
- 🟠 <missing significant signal>
- 🟡 <nice to have>
### High-cardinality risks (metrics)
- `<metric>{<label>}` — <why this is dangerous>
### Alert quality
- ✅ <good alert>
- 🔴 <noisy / unactionable alert>
### Recommended additions (prioritized)
1. …
2. …
Anti-patterns
- ❌ Logging at
errorlevel for handled exceptions — false alarms - ❌
console.log("here")left in production code - ❌ Unstructured logs ("User 42 placed order 7 for $19.99 on 2026-05-22") — hard to query
- ❌ Average latency as your latency metric
- ❌
user_idin metric labels - ❌ Logging passwords, tokens, credit cards, full PII
- ❌ Alerts on causes (CPU high) instead of symptoms (users seeing 500s)
- ❌ Twelve dashboards, none of them maintained
- ❌ Adding logs but not searching them — if you've never queried it, you don't need it
- ❌ Tracing on but never sampled / never looked at
References
- Google SRE Book — Monitoring Distributed Systems — the four golden signals
- USE Method (Brendan Gregg) — Utilization, Saturation, Errors for resources
- RED Method (Tom Wilkie) — Rate, Errors, Duration for services
- OpenTelemetry — vendor-neutral instrumentation API
- Charity Majors on observability — pointed essays on doing it well