--- title: 'Observability Pattern' description: 'How to know what the system is doing in production, beyond ''tests passed''.' --- # Observability Pattern How to know what the system is doing in production, beyond "tests passed". ## TL;DR (human) Three signals: **metrics** (counters/gauges/histograms, low cardinality), **logs** (events, structured, queryable), **traces** (request spans across services). Define SLOs (Service-Level Objectives) that capture user-perceived correctness; alert on SLO burn rate, not on noisy thresholds. Per-tenant attribution is mandatory in multi-tenant systems. ## For agents ### The three signals | Signal | Question it answers | Storage shape | Volume | |---|---|---|---| | **Metrics** | "How is the system trending?" | Time series, aggregated | Low (cardinality controlled) | | **Logs** | "What exactly happened on this request?" | Structured events | High (sampled / retained per class) | | **Traces** | "Where did the time / failure go in this request?" | Spans + dependencies | Medium (often sampled) | You need all three. Each answers questions the others cannot. ### Metrics — what to collect **Per service, default set**: - **RED**: Rate (requests/s), Errors (errors/s), Duration (latency histogram). - **USE**: Utilization (CPU/mem/IO %), Saturation (queue depth), Errors (system-level). **Per business event**: counter per meaningful product event (`user.invited`, `flow.executed`, `payment.charged`). These drive product metrics + sanity dashboards. **Cardinality discipline**: metric labels should be low-cardinality. `service` + `method` + `status` is fine. `user_id` as a label is fatal — every user explodes the metric count. Use logs / traces for high-cardinality dimensions. ### Logs — what to log Structured JSON, not free-form strings: ```ts logger.info("user.invited", { workspaceId, // multi-tenant attribution inviterId, inviteeEmail: "", // PII redacted requestId, durationMs: 47, }); ``` **Required fields on every log**: - `level` (info / warn / error / debug). - `tag` (the source component). - `timestamp` (ISO-8601 UTC). - `requestId` (correlates with traces). - `workspaceId` / `tenantId` (multi-tenant attribution). **What to log** (good signal): - Boundary crossings (request enters / exits). - Business events (the named events above). - Recoverable errors (with `cause`). - State transitions. **What NOT to log**: - Routine per-row reads. - Inside hot loops. - Anything with raw PII / secrets — redact at the logger. ### Traces — what to trace A trace is a tree of spans for one request, across services / async boundaries. **Span at**: - Every boundary (HTTP / RPC / IPC). - Every external call (DB query, third-party API, message-bus publish). - Significant in-process operations (a long parse, an expensive computation). **Span attributes**: - Service + method name. - Status (ok / error). - Duration. - Request-id propagated across boundaries. - Multi-tenant attribution. Sampling: 100% of error traces; per-tenant sampling of successful traces (e.g. 1%). Critical paths (payment, security) sampled higher. ### Correlation The unifying field is `requestId`. Every signal carries it: - Logs include `requestId`. - Traces use `requestId` as the trace id. - Metric exemplars (when supported) link to a representative trace via requestId. From a single user-reported issue: read the logs by requestId → jump to the trace → see the metric at that time. Five minutes of triage, not an hour. ### SLOs and SLIs **SLI (Service-Level Indicator)**: a measurable thing. "p95 latency of `users.list`". "Error rate of `payments.charge`". **SLO (Service-Level Objective)**: a target. "p95 < 200ms over rolling 30 days". "Error rate < 0.1% over rolling 7 days". **SLA (Service-Level Agreement)**: the contractual version of an SLO with consequences. Usually weaker than internal SLOs (you pad internally). Pick SLIs per **user journey**, not per service. The user does not care that `users-service` is fast if `auth-service` is slow blocking their login. Example SLO catalogue: | Journey | SLI | SLO | |---|---|---| | Login | p95 end-to-end latency | < 1s over 30 days | | Login | success rate | > 99.9% over 7 days | | Run flow | p95 dispatch latency | < 500ms over 30 days | | Run flow | success rate (excluding user errors) | > 99.5% over 7 days | | Page load (dashboard) | p95 TTFB | < 800ms over 30 days | ### Error budget For each SLO, the **error budget** is what is allowed to fail. 99.9% / 30 days = 43 minutes of badness allowed. When the error budget burn rate is high (burning a month's budget in a day), alert. When the budget is exhausted, freeze risky changes (feature rollouts, infra migrations) until budget recovers. Error budget is the framework for negotiating reliability vs feature velocity: - Budget intact → ship features fast. - Budget low → focus on reliability. ### Alerting Alert on **user-impacting** failures, not on every anomaly: - High burn rate on an SLO (you'll exhaust the budget within hours). - Cross-cutting saturation (CPU 95% on every node). - Specific catastrophic events (audit ledger verification failed, vault unreachable, region down). Anti-alerts (avoid): - "Error count > 5 in 1 minute" — noise, churn. - Every individual ERROR log line. - Every transient latency spike. Alerts should wake someone. If they would not be actionable at 3 AM, they should not page. ### Dashboards Per service: - RED metrics. - USE metrics. - Top business events (counts per minute). - SLO burn-rate. Per team: - The SLOs they own. - Recent incident burndown. - Top error sources (by count, by user impact). Per tenant (for support): - Their request rate, error rate, p95 latency. - Their quota usage. ### Cost Observability is expensive. Discipline: - **Metrics**: low cardinality; aggregate at source where possible. - **Logs**: structured + sampled; retention tiered (full for 7 days, sampled for 90, cold for 1 year). - **Traces**: tail-sampled (keep error traces in full; sample success). Forecast your observability bill alongside your infra bill. Surprise observability costs are common. ### Multi-tenant attribution (mandatory) Every signal in a multi-tenant system carries the tenant id. Support runs queries scoped to one tenant. Cost attribution per tenant flows from this. If you cannot attribute a metric / log / trace to a tenant, you cannot: - Help that specific customer. - Bill that customer (cost-based pricing). - Detect noisy-neighbor effects. - Honour DSAR (delete that tenant's logs). ### Common failure modes - **High-cardinality metric label.** Time-series DB blows up. → User id in logs/traces, not metric labels. - **Free-form log messages.** Cannot query. → Structured logs. - **Alerts on every error.** Pager fatigue; real alert ignored. → Alert on burn rate / impact. - **No trace correlation.** Request fails; logs are scattered; no causality. → `requestId` everywhere. - **No SLOs.** "Is the system OK?" answered by feel. → Define + dashboard + alert. - **Tenant attribution missing.** Cannot help a specific customer. → Mandatory tag on every signal. - **Observability stack costs as much as the product.** Sampling + retention not tuned. → Tiered retention; sampling. ### Tooling stack (typical) | Concern | Tool | |---|---| | Metrics | Prometheus, Datadog, Cloudwatch, Grafana Cloud | | Logs | Loki, Elastic, Datadog, Cloudwatch Logs | | Traces | OpenTelemetry + Jaeger / Tempo / Datadog APM | | Dashboards | Grafana, Datadog | | Alerting | Alertmanager, PagerDuty, Opsgenie | | Errors / exceptions | Sentry, Rollbar | | RUM (real-user monitoring) | Datadog RUM, Sentry, NewRelic | OpenTelemetry as the **instrumentation standard** lets you swap backends. ### See also - [`universal.md`](./universal.md) — gates produce actionable signals; observability extends the principle to runtime. - [`performance-budgets-pattern.md`](./performance-budgets-pattern.md) — perf budgets are derived from observability data. - [`chaos-engineering-pattern.md`](./chaos-engineering-pattern.md) — observability is a precondition. - [`../security/audit-ledger-pattern.md`](../security/audit-ledger-pattern.md) — distinct from observability (compliance vs operations).