On-Call Rotation Pattern
How to keep a system reliable 24/7 without burning out the team that maintains it.
On-Call Rotation Pattern
How to keep a system reliable 24/7 without burning out the team that maintains it.
TL;DR (human)
On-call is a structured rotation: who responds, in what order, with what tools, for which alerts. A healthy rotation alerts rarely (the system is reliable), pages with high signal (every page is actionable), and rewards the carrier (compensation, time off, growth credit). A burning rotation breaks the team faster than it breaks the system.
For agents
Rotation structure
| Element | Default |
|---|---|
| Primary | One person, on duty 24/7 for the rotation period |
| Secondary | Backup; escalation if primary unreachable in 15 min |
| Period | 1 week (5–7 days); never longer; avoid 24-hour fragments |
| Frequency | Each person on call no more than 1 week in 6 (i.e. ≥ 6-person rotation) |
| Handoff | Live sync at start of week: open incidents, known risks, pending work |
Smaller teams (< 6) need creative coverage: pair primary across the week; rotate days; or hire / outsource until rotation is sustainable.
Alert hygiene
Every alert that fires must answer:
- What is broken? (specific, not "service is down")
- Who is affected? (which users, how many)
- What action does the responder take? (runbook link)
- What is the SLO impact? (burn rate)
- What is the severity? (SEV-1/2/3/4 — see below)
If an alert can't answer all five, it's noise. Either silence it or improve it.
Severity ladder
| Sev | Criterion | Response time |
|---|---|---|
| SEV-1 | Total or major outage; customer-impacting; SLO red | Page primary immediately; respond ≤ 5 min |
| SEV-2 | Significant degradation; some customers impacted | Page primary; respond ≤ 15 min |
| SEV-3 | Issue with workaround; SLO at risk; cost spike | Notify (chat/email); respond next business hour |
| SEV-4 | Annoyance; data inconsistency; non-customer-facing | Ticket; address in next sprint |
Page-worthy = SEV-1/2 only. Anything that pages at SEV-3+ is mis-tuned.
What gets paged
Page-worthy categories:
- Customer-facing outage: API down, login broken, checkout failing.
- SLO burn-rate critical: error budget exhausting fast.
- Security incident: suspected breach, leaked secret, auth bypass.
- Data integrity at risk: corruption, replication divergence, audit-ledger verification failure.
- Cost catastrophe: unbounded resource consumption (DoS, runaway query, infinite retry).
What does not page:
- Single failed test in CI.
- Latency spike for one minute.
- Deploy failure (it should not have deployed at all).
- Non-critical job failure (queue it for next-day).
- Customer support tickets (different channel).
Incident response — the IMOC pattern
When a SEV-1/2 fires:
- Incident Manager On Call (IMOC): takes the page; coordinates. Not necessarily the fixer.
- Technical Lead On Call (TLOC): drives the fix. May be the primary if rotation is small.
- Communications: posts status updates; updates status page; talks to customers (someone else, not the fixer).
- Scribe: documents the timeline as it happens. The post-mortem starts at the first page, not after.
For small teams, one person plays multiple roles. The point: each role is explicit; nobody fixes AND communicates AND documents (that's how mistakes happen).
Runbooks
Per known incident class, a runbook lives at a stable URL. Contents:
- Symptoms: how you know this is happening.
- Verification: commands / dashboards that confirm.
- Immediate mitigation: actions that reduce blast radius before root cause.
- Diagnosis: where to look; what to query.
- Resolution: how to fix.
- Rollback: if the fix is wrong.
- Comms template: what to say to customers / status page.
Runbooks are tested. A runbook that has never been followed is a guess; quarterly drill validates.
Status pages
Customer-facing status page:
- Updated by the IMOC during incidents.
- Severity matches what customers see (not internal sev).
- Updates every 30 min minimum during active incident.
- Post-resolution: brief summary; link to post-mortem when published.
Status page is the contract for transparency. Customers tolerate outages; they do not tolerate silence.
Post-mortems
After every SEV-1/2:
- Written within 5 business days.
- Blameless: focuses on system gaps, not human errors.
- Includes timeline (from first page to resolution).
- Root cause analysis: usually the "5 whys" chain.
- Action items with owners + dates.
- Published internally (team-wide); sometimes externally (transparency).
Anti-patterns:
- Post-mortem that blames an individual.
- "Action items" with no owner / no date.
- Post-mortem that never gets written ("we know what happened").
- Same incident class repeats; action items never landed.
Compensation + recovery
On-call has cost. Reward it:
- Compensation: per-shift stipend OR equivalent time off OR explicit credit toward promotion.
- Post-shift recovery: a day off after a heavy week is not a luxury; it's load balancing.
- Page-night reward: pages at 3 AM compensate further. Adjust the next day off.
- Swap freedom: people swap shifts without bureaucracy (within the rotation).
Teams that under-compensate on-call burn out the senior engineers first. They leave; juniors take their place under-prepared; pages get worse; spiral.
On-call as a learning surface
Done well, on-call accelerates engineer growth:
- Forced exposure to the whole system.
- Real incidents teach incident response.
- Runbook authoring is documentation practice.
- Post-mortem participation is system reasoning.
Pair junior + senior on rotation; rotate roles so juniors take IMOC eventually.
Alert tuning loop
Quarterly review of all alerts:
- Page-worthy ratio: pages that turned out to be actionable / total pages. Target > 80%.
- Mean time to acknowledge (MTTA): how fast pages get accepted.
- Mean time to resolve (MTTR): how fast incidents close.
- Alert that never fires: review; either preventable (good) or no longer relevant (delete).
- Alert that fires often: investigate the underlying instability.
Delete more alerts than you add. The total alert count should be small enough to memorise.
Tools
| Concern | Tool |
|---|---|
| Paging | PagerDuty, Opsgenie, VictorOps, Splunk On-Call |
| Status page | Statuspage, Better Stack, Instatus |
| Incident comms | Slack channel + IMOC plays |
| Runbooks | Confluence, Notion, repo-committed markdown |
| Post-mortems | Confluence, repo-committed markdown, Sentry, FireHydrant, Jeli |
| Schedule | Calendly, Lever, native to paging tool |
Common failure modes
- One-person rotation. Burnout guaranteed; bus factor of 1. → Minimum 4–6; bring in cross-team rotation if needed.
- No runbooks. Incidents resolved by tribal knowledge. → Document; quarterly drill.
- Page-fatigue. Every alert pages; responders ignore; real one missed. → Tune; SEV ladder; delete noisy alerts.
- No post-mortems. Same incident recurs. → Mandatory after SEV-1/2.
- Post-mortem blame. Engineers cover up incidents to avoid scrutiny. → Blameless; system-focused.
- Manager off-rotation. Decisions stall during incidents. → Manager rotates too; or designate ICs.
- Hire-and-toss. New engineers thrown on rotation without training. → Shadow shifts; pair rotation; document onboarding.
- No comp. Senior engineers leave. → Compensate; the alternative is more expensive.
Adoption path
- Day 0: business hours best-effort. Acknowledge: not 24/7 yet.
- Pre-customer: identify paging-worthy alerts; minimum 4 people in rotation.
- Beta launch: SEV ladder; runbooks for top 5 known failure modes.
- GA: 24/7 rotation; full alert hygiene; post-mortem discipline.
- Scale: IMOC role distinct from TLOC; comms role separate.
- Mature: chaos engineering integrated; runbooks tested; alerts tuned quarterly.
Adopting too fast (full 24/7 rotation before product is stable) burns the team. Adopting too slow (no rotation at GA) burns customers.
See also
../quality/observability-pattern.md— SLOs + alerts feed paging.../quality/chaos-engineering-pattern.md— drills test the rotation.audit-ledger-pattern.md— security incidents flow through on-call.secrets-leak-postmortem-playbook.md— specific runbook example.../../phases/06-operate/README.md— on-call sits inside Operate.
Multi-Tenant Isolation Pattern
How to host many customers in one system without one customer leaking, blocking, or paying for another.
Rate Limiting + DDoS Protection Pattern
How to absorb traffic spikes — both benign (popular feature) and malicious (DDoS / abuse) — without breaking for legitimate users.