How to turn observability data into a healthy alerting setup — pages that matter, runbooks that work, alert hygiene that prevents pager fatigue.

Alerting + Runbooks Pattern

How to turn observability data into a healthy alerting setup — pages that matter, runbooks that work, alert hygiene that prevents pager fatigue.

TL;DR (human)

Alert on user-impacting failures, not on every anomaly. Each alert has a runbook with five sections (symptoms / verify / mitigate / diagnose / resolve). Tune alerts on a cadence: aim for > 80% page-worthy ratio. Pager fatigue is the silent killer of on-call quality.

For agents

What deserves an alert

A signal becomes an alert when all of these hold:

User-impacting (or will be soon).
Actionable by the on-call responder.
Recoverable within minutes-to-hours by a single responder.
Confirmable (the responder can verify the alert is real).

If any condition fails, it's not an alert:

Not user-impacting → dashboard / weekly review.
Not actionable → diagnostic; not paging-worthy.
Recovery requires team effort → SEV escalation; broader response, not single page.
Cannot confirm → false-positive risk; tune the signal first.

Alert types

Symptom alerts (preferred): "the user-visible thing is broken".

p95 latency for journey X > Y for N minutes.
Error rate for endpoint Z > 1%.
Synthetic check for golden path failing.

Cause alerts: "an underlying thing is failing".

DB connection pool saturation.
Disk > 90% full.
Queue depth growing > rate.

Symptom alerts catch what users see. Cause alerts catch what produces those symptoms.

Both are needed. Symptom alerts as the primary; cause alerts as predictive (catch before the symptom).

Burn-rate alerting (SLO-based)

Modern best practice: alert on SLO error budget burn rate, not on raw thresholds.

SLO: 99.9% over 30 days
Error budget: 0.1% = 43.2 minutes of badness

Burn rate = current error rate / acceptable error rate.

Burn rate	Time to exhaust 30-day budget	Page?
1×	30 days	No
2×	15 days	Maybe — slow burn
14×	2 days	Yes (high severity)
100×	7 hours	Yes (critical)

Multi-window approach (catches both fast and slow burn):

Fast window (5 min): burn rate > 14× over 5 min → critical alert.
Slow window (1 h): burn rate > 1× over 1 h → warning.

This avoids both pager fatigue (slow degradation pages too often) and silent budget exhaustion (slow drift not caught).

Per-alert anatomy

Every alert config includes:

- name: payments-p95-latency-burn-rate
  description: Payment endpoint p95 latency violating SLO
  severity: SEV-1
  query: |
    (rate(http_request_duration_seconds{handler="/api/charge",quantile="0.95"} > 0.5) > ...)
  windows:
    - 5m, threshold 14
    - 1h, threshold 1.5
  runbook: https://runbooks.example.com/payments-latency
  team: payments
  page_at: SEV-1

Required fields:

Name + description: human-readable.
Severity: maps to paging policy.
Query: the actual signal.
Windows + thresholds: when to fire.
Runbook URL: link.
Team: routing.

If you can't fill runbook URL, the alert isn't ready.

Runbook structure

Five sections:

# Runbook — Payments p95 latency violating SLO

## Symptoms

- Alert: `payments-p95-latency-burn-rate` firing.
- User-visible: checkout slow; customer reports.
- Dashboard: payments p95 dashboard shows red.

## Verify (1 minute)

- Open dashboard: <link>
- Confirm p95 elevation is real (not metric spike).
- Check if related alerts also firing (DB, payment-provider, dependency).

## Mitigate (5 minutes)

- If a recent deploy: roll back (last good revision: `git log origin/main`).
- If a payment provider issue: flip kill-switch to fallback provider (see `flags-pattern.md`).
- If a DB issue: route to replica (see `distributed-data-pattern.md`).

## Diagnose (15 minutes)

- Trace: <link to traces filtered to the slow endpoint>.
- Recent deploys: <link>.
- Provider status: <link to third-party status page>.
- Anomalies: <link to dashboard>.

## Resolve

- If root cause is in our code: ticket; fix in next PR.
- If root cause is provider: comms to customers; track provider's resolution.
- Post-mortem within 5 business days.

## Last reviewed

YYYY-MM-DD by @owner. Next review: quarterly.

A runbook last reviewed > 6 months ago is suspect.

Alert routing

Severity	Action	Who	Mechanism
SEV-1	Page primary	On-call primary	PagerDuty / Opsgenie / Splunk On-Call
SEV-2	Page primary	On-call primary	Same; possibly different paging policy
SEV-3	Notify	Team channel	Slack / Teams
SEV-4	Ticket	Backlog	Jira / Linear / GitHub Issues

Per-team routing: the alert knows which team owns the affected service.

Alert tuning loop (quarterly)

For every alert in the system, ask:

Question	Action if "no"
Did this fire this quarter?	If never: delete (or document as latent SEV-1 sentinel)
Was every fire actionable?	If < 80%: tune threshold / scope
Did the runbook help?	If no: rewrite or delete the runbook
Did the fire correlate with user impact?	If no: convert to non-paging dashboard
Has the underlying alert query changed since last review?	If yes: re-validate threshold

Track:

Page-worthy ratio per alert (actionable fires / total fires).
MTTA (mean time to acknowledge) per alert.
Pages per shift (target: < 2 per week per responder; ≤ 0 ideal).

Alert fatigue — the cycle

Too many alerts → responders ignore → real alert missed → outage → more alerts added → fatigue worsens.

Symptoms:

Pages auto-acknowledged without reading.
"I'll look at it after my coffee" responses.
New team members ignored on first pages.
Outages where the alert was present but ignored.

Recovery:

Delete more aggressively than you add.
Audit alerts quarterly with the team that gets paged by them.
Move medium-signal alerts to dashboards.

Synthetic checks

For golden paths (login, checkout, primary-feature-X):

Synthetic check from outside the system, every minute (or 5 min).
Multi-region (catches geo-localised failure).
End-to-end (not just "ping returns 200").
Alert on N consecutive failures (avoid single-shot false positives).

Synthetic checks catch what internal metrics miss — network-edge issues, DNS, CDN, third-party-provider-down.

Service health page

Per service:

Currently-firing alerts.
Recent incidents.
SLO burn rates.
Recent deploys.
Owning team + on-call contact.

A single URL per service that answers "is this service okay?" without expertise.

Alert dependencies

Some alerts depend on others. The classic trap:

Service A's alert fires.
Service B (which depends on A) also alerts; redundant.
Service C (which depends on B) also alerts; double-redundant.
One incident → 7 pages.

Mitigations:

Alert grouping: cluster related alerts into one notification.
Dependency-aware suppression: if A is firing, suppress dependents until A resolves.
One service of record: only the leaf service's alert pages; uphill services log but don't page.

Alert blast-radius

When an alert fires, what's the radius?

One service → service team paged.
Multi-service → cross-team incident; escalate to IMOC (see ../security/on-call-rotation-pattern.md).
Whole system → SEV-1; full incident response.

Routing logic accounts for radius:

Single-service alert → team page.
Multi-service correlation → IMOC page.
Auth / billing / audit failure → security team + IMOC.

Pre-alert: dashboards + logs

Not every signal alerts. Three tiers:

Pages: the small set that pages someone.
Dashboards: visible during business hours; reviewed weekly.
Logs / traces: queryable on demand; not surfaced unless investigated.

The split avoids both fatigue (everything pages) and silent drift (nothing surfaces).

Common failure modes

Alert on every error. Pager fatigue. → Symptom alerts; SLO burn rates.
No runbook. Responder guesses. → Mandatory runbook URL on every alert.
Runbook says "see logs". Useless. → Specific queries, links, commands.
Alert fires for stale reasons. Half the team ignores; new joiners don't. → Quarterly tune.
No synthetic checks. Real users feel outage before metrics. → Synthetic for golden paths.
Per-host alerts in cattle-not-pets infra. One pod restart = page. → Aggregate; alert on the abnormal rate, not individual.
No grouping. One incident = 20 pages. → Group; suppress dependents.

Tooling stack (typical)

Concern	Tool
Alert engine	Alertmanager (Prometheus), Grafana Alerting, native cloud (CloudWatch, GCP Monitoring)
Paging	PagerDuty, Opsgenie, Splunk On-Call, VictorOps
Synthetic checks	Datadog Synthetics, Pingdom, Checkly, native cloud
Dashboards	Grafana, Datadog, native cloud
SLO management	Sloth (Prometheus SLO generator), Datadog SLOs, in-house
Runbooks	Confluence, repo-committed markdown, FireHydrant, Jeli
Incident tracking	FireHydrant, Jeli, Incident.io

Alerting + Runbooks Pattern

Alerting + Runbooks Pattern

TL;DR (human)

For agents

What deserves an alert

Alert types

Burn-rate alerting (SLO-based)

Per-alert anatomy

Runbook structure

Alert routing

Alert tuning loop (quarterly)

Alert fatigue — the cycle

Synthetic checks

Service health page

Alert dependencies

Alert blast-radius

Pre-alert: dashboards + logs

Common failure modes

Tooling stack (typical)

See also

On this page