# Agents Playbook — Full Bundle Single-file dump of every doc. Documents are separated by lines starting with "==== ". Each section begins with the canonical URL at https://playbook.agentskit.io. Generated automatically; treat as the canonical reference snapshot. ==== https://playbook.agentskit.io/docs/getting-started --- title: Getting started description: Adopt the playbook in your project in under an hour. --- The playbook is designed to be **adopted incrementally** — you can pull one pillar at a time, copy the templates that fit, and wire the gate scripts as you go. ## 60-minute adoption path 1. **Copy `CLAUDE.md`** (and `AGENTS.md` if you don't already have one) from [templates](/docs/templates) into your repo root. Customise the non-negotiables to your stack. 2. **Add the `MEMORY.md` index** + a `.agent-memory/` directory. Every lesson learned becomes a one-file memory. 3. **Pick 2-3 gate scripts** from [scripts](/docs/scripts) — start with `check-file-size`, `check-no-any`, `check-named-exports`. Wire them as a `pnpm check:quality-gates` script. 4. **Stand up `docs/adr/`** with ADR-0001 (Philosophy) using the [ADR template](/docs/templates/ADR.template). 5. **Adopt the PR-intent manifest** ([template](/docs/templates/PR-intent.template)) on the next PR. That's it for day one. Everything else flows from these. ## Per-pillar deeper adoption | Pillar | Most-impactful first move | |---|---| | [Architecture](/docs/pillars/architecture) | Modular boundary table in `AGENTS.md` | | [Security](/docs/pillars/security) | Threat model skeleton + vault stub | | [UI / UX](/docs/pillars/ui-ux) | Tokens + 8-10 primitives in `packages/ui` | | [Quality](/docs/pillars/quality) | Gate orchestrator + per-package coverage targets | | [Governance](/docs/pillars/governance) | PR-intent gate + tombstone convention | | [AI-collaboration](/docs/pillars/ai-collaboration) | `CLAUDE.md` + MEMORY pattern | ## For agents working in adopting repos If you're an agent landing in a repo that has adopted this playbook: 1. Read `CLAUDE.md` first. 2. Read `AGENTS.md` to find which package to touch. 3. Pull the relevant doc from this site via raw URL: `https://playbook.agentskit.io/raw/.md`. 4. Follow the per-pillar discipline. 5. Update memory when a lesson lands. ## Not adopting wholesale? That's fine. The playbook is modular. Take what fits; leave the rest. Each pattern doc explains the failure mode it prevents — if you don't have that failure mode, you don't need that pattern. ==== https://playbook.agentskit.io/docs/glossary --- title: 'Glossary' description: 'Short definitions for the terms used across this playbook.' --- # Glossary Short definitions for the terms used across this playbook. | Term | Definition | |---|---| | **ADR** | Architecture Decision Record. Numbered, append-only doc capturing one decision. Source of truth for codebase structure. | | **RFC** | Request for Comment. ADR-with-review-window, used for changes to public contracts. Promotes to ADR on acceptance. | | **Boundary** | Anywhere bytes enter the process from outside the trust boundary: HTTP, IPC, JSON-RPC, file IO, env, message bus. | | **Contract** | A schema (typically Zod / Pydantic / Protobuf) that defines the shape crossing a boundary. | | **Contract package** | The dependency-free package that owns all shared schemas + the error model. | | **Stable code** | A string error code (`\_\`, all caps) that clients pattern-match on. Append-only. | | **AppError** | The single base error class. Every other error subclasses it. | | **Routing table** | The `AGENTS.md` mapping from "I want to change X" to "edit package Y". | | **PR-intent manifest** | YAML block in PR description: `adds:`, `changes:`, `removes:`, `tests:`, `docs:`. Verified by a gate. | | **removes-list** | The `removes:` entries in the manifest. Removing another author's exported symbol requires one. | | **Sub-unit** | One discrete, shippable change. One sub-unit per session. | | **Phased PR** | A long initiative shipped as a chain of phase PRs, each merged with `--merge --admin`. | | **Verify-first** | Confirm state (issue open, branch fresh, file path real) before acting. | | **Tombstone** | Status block prepended to a retired doc indicating it is no longer active. Body kept for trail. | | **Completeness contract** | Per-screen rule that no `TODO`/disabled tab/empty body ships. | | **Hermetic test** | In-process test that reproduces a bug without external services. Preferred over E2E for repro/lock. | | **Quality gates** | Fast structural checks (file size, no `any`, named exports, intl, tokens). Run pre-commit + CI. | | **Sanity** | Cross-cutting periodic audit. Generates a report; CI fails on regressions. | | **Shrink-only baseline** | Gate config: existing offenders are baselined, new violations fail. Baseline can only shrink. | | **Break-glass** | Time-boxed admin role elevation with signed audit trail. | | **Consent (in security)** | Scoped, time-boxed user approval for a specific action. Distinct from role elevation. | | **Egress allowlist** | Default-deny outbound network policy; only allowlisted destinations resolve. | | **Sealer** | The key that encrypts secrets at rest. Rotatable. | | **Audit ledger** | Append-only signed log of privileged operations. Verifiable via batch signature / Merkle anchor. | | **DSAR** | Data Subject Access Request. GDPR-style export / delete request. | | **Legal hold** | Flag that suspends retention for a subject under investigation. | | **Whitelabel / OEM** | Per-tenant brand kit + plan presets that swap product name, logo, palette at build / runtime. | | **MEMORY pattern** | One-fact-per-file persistent agent memory with an index file (`MEMORY.md`). | | **Sub-agent** | Scoped specialist agent delegated for a fan-out task. | | **Slash command** | Palette-invoked workflow body. | | **Goal mode** | Stop hook with a condition; agent works until the condition holds. | ==== https://playbook.agentskit.io/docs/index --- title: Welcome description: The gold-standard playbook for shipping production software with AI coding agents. --- The **Agents Playbook** captures the rules, guardrails, prompts, gates, and review patterns that consistently produce trustworthy, shippable code from AI coding agents like Claude, Cursor, and Copilot — distilled from a year of agent-driven development on a real production codebase. ## How it's organized A **matrix of 6 pillars × 6 SDLC phases**. See the [matrix](/docs/matrix) for the cross-reference. ``` pillars/ architecture/ # ADR, RFC, modular monorepo, contracts, errors, ... security/ # RBAC, vault, audit, compliance, AI/LLM safety, ... ui-ux/ # tokens, primitives, intl, a11y, whitelabel, ... quality/ # tests, gates, sanity, observability, FinOps, ... governance/ # PR intent, merge rules, tombstones, ... ai-collaboration/ # CLAUDE.md, MEMORY, sub-agents, slash commands, ... phases/ 01-discover/ 02-design/ 03-build/ 04-test/ 05-ship/ 06-operate/ templates/ # ADR, RFC, PR-intent, CLAUDE.md, AGENTS.md, MEMORY.md prompts/ # system, sub-agent, slash-command bodies scripts/ # gate reference implementations (Node, zero deps) ``` ## Dual mode Every doc has two layers: - **TL;DR (human)** — one paragraph for the linear reader. - **For agents** — structured sections, fixed shape, optimised for RAG retrieval and system-prompt injection. ## Start here | Goal | Read | |---|---| | Adopt in a new project | [Getting started](/docs/getting-started) | | Set non-negotiables for agents | [CLAUDE.md template](/docs/templates/CLAUDE.md.template) | | Design a package boundary | [Architecture · Universal](/docs/pillars/architecture/universal) | | Add an ADR or RFC | [ADR template](/docs/templates/ADR.template) · [RFC template](/docs/templates/RFC.template) | | Wire quality gates | [Quality gates](/docs/pillars/quality/quality-gates-pattern) | | Train an agent on lessons | [MEMORY template](/docs/templates/MEMORY.md.template) | | Multi-agent merge | [Governance](/docs/pillars/governance) | ## The eight non-negotiables The irreducible kernel. If an agent breaks one of these, fail the PR. 1. **Typed boundaries.** Every external input is parsed by a runtime schema. No `any`. 2. **Named exports only.** Predictable refactors, predictable agent edits. 3. **Typed error hierarchy with stable codes.** `AppError` subclasses with `_` constants. 4. **Centralized logger.** `createLogger(tag)`; never `console.log` in shipped code. 5. **ADR before architecture change. RFC before breaking a public contract.** 6. **Ship complete or don't ship.** No `TODO`/`FIXME`/`not implemented` in shipped surfaces. 7. **Merges sum work, never subtract.** PR intent manifest; `removes:` justified. 8. **Tokens, intl, primitives.** No raw values in user-facing surfaces. Each is fully spec'd across the pillars and enforced by gate scripts. ## For agents reading this - The full bundle in one file: [`/llms-full.txt`](/llms-full.txt). - Site map: [`/llms.txt`](/llms.txt). - Raw markdown for any doc: replace `/docs/` with `/raw/.md`. - ZIP of all docs: [`/playbook-bundle.zip`](/playbook-bundle.zip). ==== https://playbook.agentskit.io/docs/matrix --- title: 'The matrix — pillars × phases' description: 'This is the master content map. Each cell points to the practices that apply when **pillar** meets **SDLC phase**.' --- # The matrix — pillars × phases This is the master content map. Each cell points to the practices that apply when **pillar** meets **SDLC phase**. Cells marked `(stub)` are scaffold-only; content lands in subsequent sessions. Cells marked `✓` are shipped. | | 01 Discover | 02 Design | 03 Build | 04 Test | 05 Ship | 06 Operate | |---|---|---|---|---|---|---| | **Architecture** | Define mental map; pick 5–7 logical groups | ADR before structure change; RFC before breaking contract | Modular boundaries; named exports; Zod at every edge; sized files | Contract tests; cross-package import lints | Versioned releases; peer-dep compat matrix | Track tech debt against ADRs | | **Security** | Threat model; data classification | Vault scope; RBAC roles; consent vs elevation | Sealed secrets; egress allowlist; signed audit ledger | Security review per PR; pen-test gates | Signed artifacts; key rotation runbook | Incident response; break-glass audit | | **UI-UX** | Audience taxonomy; surface inventory | Design tokens; primitives catalog; motion budget | No raw `\`/`\`; intl on every string; `useT()` everywhere | Visual regression; a11y screen-reader pass | Brand-kit/whitelabel build matrix | A/B + delight loops; empty-state honesty | | **Quality** | Define "done" per surface (completeness contract) | Test pyramid; coverage targets per package | File-size budgets; lint rules; hermetic tests | `check:all`, `sanity`, structural gates, e2e | Pre-push hooks; release gates | Bug-hunt cadence; mutation testing | | **Governance** | Decision log culture; ADR/RFC processes | PR intent manifest schema; merge rules | Removes-list discipline; concurrent-agent awareness | Multi-agent review pipeline | Changesets; semver discipline | Postmortems; tombstone retired plans | | **AI-collaboration** | CLAUDE.md, AGENTS.md, MEMORY.md bootstrap | Slash commands; sub-agent recipes; system prompts | Goal-mode loop; one-sub-unit-per-session rule | Verify-first close; duplication-claims-API-grounded | Phased PR + admin merge | Persistent memory; lessons graph | ## Reading order If you adopt left-to-right (by phase), agents can ramp incrementally. If you adopt top-to-bottom (by pillar), you can roll out one concern across the whole SDLC. | Adoption mode | Start at | Then | |---|---|---| | Greenfield project | `phases/01-discover/` | `templates/CLAUDE.md.template.md` → `pillars/architecture/universal.md` | | Brownfield retrofit | `pillars/quality/README.md` (gates first) | `pillars/governance/README.md` (PR intent + merge rules) | | Just need agent rules | `templates/CLAUDE.md.template.md` | `pillars/ai-collaboration/README.md` | | Just need design system | `pillars/ui-ux/README.md` | `templates/` design-tokens recipe | ## Status legend - ✓ Shipped (read it) - ◐ Scaffolded with scope; content partial - (stub) README placeholder only; no body content yet ## Current status (v0) | | Universal | TS-concrete | |---|---|---| | architecture | ✓ | ✓ | | security | ✓ | ✓ | | ui-ux | ✓ | ✓ | | quality | ✓ | ✓ | | governance | ✓ | ✓ | | ai-collaboration | ✓ | ✓ | | Phase | Status | |---|---| | 01 discover | ✓ | | 02 design | ✓ | | 03 build | ✓ | | 04 test | ✓ | | 05 ship | ✓ | | 06 operate | ✓ | | Templates | Status | |---|---| | ADR / RFC | ✓ | | PR intent | ✓ | | CLAUDE.md / AGENTS.md / MEMORY.md | ✓ | | Prompts | Status | |---|---| | System (architect, implementer, reviewer, security) | ✓ | | Sub-agent recipes (explore, plan, code-explorer, code-reviewer) | ✓ | | Slash commands (goal, loop, review, clear, sanity, ship) | ✓ | | Other | Status | |---|---| | Scripts (gate reference impls) | ✓ all 12 gates + orchestrator | | Phases (deep content) | ✓ | ==== https://playbook.agentskit.io/docs/phases/01-discover --- title: 'Phase 01 — Discover' description: 'Define what you are building, who consumes it, and what success looks like — before agents touch the codebase.' --- # Phase 01 — Discover Define what you are building, who consumes it, and what success looks like — before agents touch the codebase. ## TL;DR (human) The discover phase produces the artefacts agents need to be productive from PR #1: a product brief, a surface inventory, a draft threat model, a routing table, and a CLAUDE.md. Skip this phase and agents reinvent rules each session. ## For agents ### Outputs (must exist before moving to Design) - [ ] **Product brief** — `docs/product/brief.md`. One paragraph: audience, problem, value, success metric. - [ ] **Surface inventory** — `docs/product/surfaces.md`. Every screen / API / CLI / integration the product exposes (or will at MVP). Status column: planned / in-flight / shipped. - [ ] **Threat model (draft)** — `docs/security/threat-model.md`. Assets, actors, attack surface. Empty threat-mitigations table is fine on day one; the structure is the asset. - [ ] **Routing table skeleton** — `AGENTS.md`. Even if the packages do not exist yet, list the planned boundaries with "(planned)" markers. - [ ] **Bootstrap doc** — `CLAUDE.md`. The non-negotiables you want agents to honour from PR #1. - [ ] **Decision log** — `docs/adr/` directory exists, with ADR-0001 (Philosophy) drafted. ### Per pillar — Discover-phase checklist **Architecture** - [ ] Sketch 4–7 logical groups; do not design 30 packages on day one. - [ ] Decide the contract package (`core` / `contracts`) and its hard size budget. - [ ] Decide your runtime schema (Zod / Pydantic / Protobuf). **Security** - [ ] Classify the data the system touches (PII / sensitive / public). - [ ] Pick the tenancy model (single-tenant / multi-tenant / multi-org). - [ ] Pick the vault provider (in-process / Vault / cloud KMS). **UI-UX** - [ ] Identify the audience (technical / non-technical / mixed). - [ ] Inventory surfaces. - [ ] Pick the design language reference (Apple HIG / Material / custom). - [ ] Pick the primitive library substrate (Radix / React Aria / Headless UI). **Quality** - [ ] Define "done" per surface (the completeness contract, kept simple early). - [ ] Pick the test stack. - [ ] Decide per-package coverage targets. **Governance** - [ ] Decide who accepts ADRs / RFCs (humans, named). - [ ] Define review windows (e.g. 5 days minor, 10 days breaking). - [ ] Decide branching model (trunk-based + short-lived feature branches recommended). **AI-collaboration** - [ ] Write `CLAUDE.md` (template in [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md)). - [ ] Write `AGENTS.md` skeleton. - [ ] Decide agent toolchain (Claude Code / Cursor / Aider / your CLI). - [ ] Bootstrap memory directory + index. ### Common failure modes - **No surface inventory.** Agents invent surfaces; product surface drifts. → Inventory first. - **No `CLAUDE.md`.** Agents reinvent rules every session. → Stand it up before the first feature PR. - **30-package monorepo on day one.** Boundaries that haven't earned themselves. → Start with 4–7 logical groups; split when cohesion forces it. - **Threat model deferred "until later".** Later never comes. → Empty structure on day one; fill iteratively. ### Exit criteria You can leave Discover when: 1. A new agent can read `AGENTS.md` and know which package owns what (even if the packages are scaffolds). 2. A new agent can read `CLAUDE.md` and know what is non-negotiable. 3. The decision-log directory exists with at least ADR-0001 accepted. 4. The threat-model doc exists with assets + actors + surface enumerated (mitigations can be empty). ### See also - [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md) - [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md) - [`../../pillars/architecture/universal.md`](../../pillars/architecture/universal.md) - [`../../pillars/security/threat-model-template.md`](../../pillars/security/threat-model-template.md) ==== https://playbook.agentskit.io/docs/phases/02-design --- title: 'Phase 02 — Design' description: 'Turn the discover brief into ADRs, RFCs, and a contract package skeleton the build phase can compose against.' --- # Phase 02 — Design Turn the discover brief into ADRs, RFCs, and a contract package skeleton the build phase can compose against. ## TL;DR (human) Design phase makes the implicit explicit. Every recurring decision becomes an ADR; every breaking-contract change becomes an RFC. The first set of schemas + error codes lands in the contract package. Tokens + primitives + locale skeletons land before any screen is built. Skip this phase and Build phase agents will design ad-hoc — which means inconsistently. ## For agents ### Outputs - [ ] **ADR-0001 (Philosophy)** — what this codebase optimises for; what it does not. - [ ] **ADR-0002 (Composition rules)** — what you depend on; what you do not duplicate. - [ ] **ADR-0003 (Contract registry)** — how methods / schemas register; how the dispatcher enforces them. - [ ] **First schemas + error codes** — landed in the contract package; tests assert parse + reject. - [ ] **Design tokens** — primitive + semantic layers; default brand kit + test brand kit. - [ ] **Locale skeleton** — at minimum `en.json`; the `useT()` hook works. - [ ] **Primitives catalog (initial)** — 8–12 primitives sufficient to compose the first 3 screens. - [ ] **Quality gates wired** — at least file-size, no-any, named-exports, raw-error, pr-intent scripts in CI. ### Per pillar **Architecture** - [ ] Stand up the contract package. Lock its size budget (CI gate). - [ ] Stand up the runtime package skeleton (depends on contract). - [ ] Stand up the storage package skeleton (depends on contract). - [ ] Sub-path package layout decided (RFC if applicable). - [ ] `AGENTS.md` routing table reflects real packages now. **Security** - [ ] RFC the auth model (sessions, principal shape, tenancy resolution). - [ ] Decide vault provider; integrate the shim; first secret accessed via reference. - [ ] Audit ledger skeleton: store + signer + verify utility. - [ ] Egress allowlist shim (`safeFetch`) wired; default deny. - [ ] Threat model populated with the first 5–10 threats × mitigations. **UI-UX** - [ ] Design tokens land in CSS variables + Tailwind config (or equivalent). - [ ] Primitives catalog ship: Button, Input, Select, Dialog, Table, EmptyState, Skeleton, Toast. - [ ] Locale infrastructure (`useT()` + `en.json` + parity gate). - [ ] Whitelabel runtime stub (default brand + test brand). - [ ] A11y baseline: axe wired in CI; baseline of existing violations captured. **Quality** - [ ] Test stack picked and wired (Vitest / Playwright equivalent). - [ ] Per-package coverage thresholds in CI config. - [ ] Quality-gates orchestrator script (`pnpm check:quality-gates`) live. - [ ] Pre-push hook wired (structural gates + typecheck + build; no full tests). - [ ] Sanity audit script + report path decided. **Governance** - [ ] PR-intent manifest format + gate. - [ ] ADR + RFC index files; check-adr / check-rfc gates. - [ ] Tombstone convention decided (the emoji block + back-ref sweep). - [ ] Phased-PR convention documented in `CONTRIBUTING.md`. **AI-collaboration** - [ ] CLAUDE.md non-negotiables locked. - [ ] Sub-agent recipes selected from [`../../prompts/`](../../prompts/) — adapted to your toolchain. - [ ] Slash commands (`/goal`, `/review`, `/sanity`) wired. - [ ] Memory directory + `MEMORY.md` index established. ### Common failure modes - **No initial ADRs.** Conventions are "in the chat"; agents reinvent them. → ADR-0001 / -0002 / -0003 minimum. - **Tokens but no whitelabel test.** Tokens "work" in default brand only. → Ship a test brand kit; render against it in CI. - **Quality gates land late.** "We'll add them after the first feature." Existing offenders accumulate; gate is too painful to turn on. → Land gates BEFORE the first feature PR. - **Primitives catalog rolled into screen work.** Each screen invents a Button differently. → Catalog first; screens second. ### Exit criteria You can leave Design when: 1. An implementer agent can pick up a feature ticket and compose against existing schemas, primitives, and gates — without inventing new ones. 2. The build phase has zero "where does X live?" questions left. 3. The first three feature ADRs land cleanly (proves the ADR pipeline works). 4. CI runs the quality gates and exits 0 on the empty / scaffold codebase. ### See also - [`../../pillars/architecture/adr-pattern.md`](../../pillars/architecture/adr-pattern.md), [`../../pillars/architecture/rfc-pattern.md`](../../pillars/architecture/rfc-pattern.md) - [`../../pillars/architecture/contracts-zod-pattern.md`](../../pillars/architecture/contracts-zod-pattern.md) - [`../../templates/ADR.template.md`](../../templates/ADR.template.md), [`../../templates/RFC.template.md`](../../templates/RFC.template.md) - [`../../pillars/ui-ux/design-tokens-pattern.md`](../../pillars/ui-ux/design-tokens-pattern.md) - [`../../scripts/README.md`](../../scripts/README.md) ==== https://playbook.agentskit.io/docs/phases/03-build --- title: 'Phase 03 — Build' description: 'Where most of the agent-augmented work happens. Discipline shifts from ''writing code'' to ''stating intent, then verifying output against it''.' --- # Phase 03 — Build Where most of the agent-augmented work happens. Discipline shifts from "writing code" to "stating intent, then verifying output against it". ## TL;DR (human) Build is a loop, not a sprint. One sub-unit per session. PR intent declared up front; gates green before merge; tests + docs in the same PR. The conventions land in Design; Build phase enforces them. ## For agents ### The build loop 1. **Pick a sub-unit.** One discrete, shippable change. Defined up front. 2. **State intent.** PR-intent manifest in the PR description (or `pr-intent.yaml`). `adds:`, `changes:`, `removes:`, `tests:`, `docs:`, `gates:`. 3. **Verify state.** `git fetch`; recheck issue state; look for in-flight peer PRs on the same paths. 4. **Plan.** If the change is non-trivial, delegate to a `subagent-plan` (see [`../../prompts/subagent-plan.md`](../../prompts/subagent-plan.md)). 5. **Implement.** Tests in the same PR. Hermetic over E2E. Tests assert on codes. 6. **Self-review.** Run `pnpm check:quality-gates`. Read your own diff as a reviewer. 7. **Open PR.** Manifest in description. Link issue, ADR/RFC, related PRs. 8. **Address review per comment.** No wholesale rewrites in response to one comment. 9. **Merge clean.** Phased PRs: `gh pr merge --merge --admin` after gates green. Delete branch. ### Per pillar — Build-phase discipline **Architecture** - [ ] If a change crosses an unclear boundary: STOP. Draft an ADR; resume after acceptance. - [ ] New schemas land in the contract package, never in a feature package. - [ ] New error codes append to the central codes file. - [ ] Named exports only. No `any` at boundaries. **Security** - [ ] Every new method declares `requireAuth: true` (or explicit `false` reviewed in PR). - [ ] Tenancy comes from session, never from body. - [ ] Privileged ops audit-log before execute. - [ ] Outbound fetch goes through `safeFetch`. - [ ] Secrets are vault refs. - [ ] No stack traces or internal IDs in wire-serialized errors. **UI-UX** - [ ] Tokens for all visual values (no hex / rgb / arbitrary class). - [ ] Shared primitives only (no native `\` / `\` / etc.). - [ ] `useT()` for every user-visible string. - [ ] `\` for content loading; spinners only for inline actions. - [ ] `\` with cause-typed variants. - [ ] Keyboard reachable; focus visible; aria labels. **Quality** - [ ] Tests in the same PR (no "tests next PR"). - [ ] Tests assert on codes / `byRole`, not on rendered text. - [ ] File-size budgets honoured; extract on overflow. - [ ] `pnpm check:quality-gates` green before merge. **Governance** - [ ] PR intent declared; `removes:` justified. - [ ] `merge-override:` annotation if conflict resolution dropped a side. - [ ] One sub-unit per PR. - [ ] Verify-first close: `gh issue view \` before "fixing". **AI-collaboration** - [ ] One sub-unit per session. - [ ] Honest reporting: failures quoted, skipped steps stated. - [ ] Persistent memory updated when a non-obvious lesson lands. - [ ] Sub-agents for fan-out (search, plan, review). - [ ] Concurrent-agent awareness: peer PRs read before starting. ### Common failure modes - **"While I'm here" scope creep.** PR balloons; review gets lost. → File the side issue; do not pursue. - **Tests deferred.** "Tests in a follow-up PR" — follow-up never lands. → Same PR or no PR. - **`--no-verify` push to skip the hook.** Bypasses the safety net. → Justify in PR; investigate the hook's slowness. - **Renaming as delete + add.** Looks like a remove in the diff; manifest is misleading. → State explicitly: "renamed X → Y" in `changes:`. - **PR description that does not match the diff.** Reviewer cannot verify. → Manifest IS the contract. ### Sub-unit examples Good sub-units (one PR each): - "Add `users.invite` method with email validation + audit log + test." - "Refactor `flow-editor` to extract `parts/properties-panel.tsx` (file-size budget)." - "Add `consent.grant` UI surface; wires existing backend method." Bad sub-units (split these): - "Add user invitations + workspace switching" (two features). - "Refactor X and fix Y" (two intents). - "Phase 1 + Phase 2 of \" (chain into separate PRs). ### Exit criteria Build is a loop, not a phase that exits. The codebase is "in Build" for most of its life. Each cycle through the loop completes when: 1. PR merged. 2. Gates green on main. 3. Issue closed; sub-unit tracker updated. 4. Memory updated if a lesson landed. ### See also - [`../../pillars/governance/pr-intent-pattern.md`](../../pillars/governance/pr-intent-pattern.md) - [`../../pillars/ai-collaboration/universal.md`](../../pillars/ai-collaboration/universal.md) - [`../../prompts/system-implementer.md`](../../prompts/system-implementer.md) - [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md) ==== https://playbook.agentskit.io/docs/phases/04-test --- title: 'Phase 04 — Test' description: 'How ''tests pass'' stops being a feeling and starts being a contract.' --- # Phase 04 — Test How "tests pass" stops being a feeling and starts being a contract. ## TL;DR (human) Five tiers of tests, ordered by cost. Spend most budget on tiers 1–2 (schema parse + unit). Reserve E2E for golden paths. Tests assert on codes / structure, not rendered text. Hermetic over E2E for bug repro. Verify-first before "fixing" a flaky test. ## For agents ### Test layers (target distribution) | Tier | Type | Runtime | % of suite | |---|---|---|---| | 1 | Schema parse / contract | <1 ms | ~30% | | 2 | Unit (pure functions, single class) | <10 ms | ~40% | | 3 | Integration (handler + store + adapter, in-process) | <500 ms | ~25% | | 4 | Visual regression / a11y | seconds | ~4% | | 5 | E2E (real app, real services) | minutes | ~1% | Inverted pyramids (mostly E2E) produce flaky, slow suites with poor signal. ### Per pillar — Test-phase discipline **Architecture** - [ ] Every contract has a parse test (happy + reject). - [ ] Every error code is asserted somewhere in the suite (a separate gate scans for `code: "\"` assertions). - [ ] Handler return values are parsed by the result schema (catches handler bugs at boundary). **Security** - [ ] Auth tests: missing `principalId` → `AUTH_REQUIRED`. - [ ] Tenancy tests: caller cannot access other-workspace data. - [ ] Egress tests: blocked host produces `SECURITY_EGRESS_DENIED`. - [ ] Audit tests: privileged action writes intent before execute. - [ ] Secrets tests: logger redaction works on known patterns. **UI-UX** - [ ] A11y: axe scan on every changed screen (`@axe-core` in CI). - [ ] Visual regression: per-primitive snapshot in default + test brand kit. - [ ] Intl parity: every key exists in every shipped locale. - [ ] Empty-state coverage: every list surface has at least one empty-state test. **Quality** - [ ] Per-package coverage hits its threshold. - [ ] Mutation testing on stable utility modules. - [ ] Property-based tests for parsers / serializers / math. - [ ] No `it("works")` / `it("test 1")` — names read like sentences. **Governance** - [ ] PR-intent gate passes (manifest matches diff). - [ ] ADR / RFC integrity gate passes. **AI-collaboration** - [ ] Verify-first before "fixing" a red signal. - [ ] Honest test reporting (failed tests quoted, skipped tests stated). ### Triage protocol — when a test fails 1. **Reproduce locally.** Confirm the failure on your machine. 2. **Stash + verify red on `origin/main`.** If main is red, the failure is pre-existing — file an issue; do not "fix" it in your branch. 3. **Determine tier.** Could a lower-tier test pin this? If yes, add the lower-tier test, fix the bug, both turn green. 4. **Fix.** The fix is the smallest diff that flips the test from red to green without changing other behavior. 5. **Add a regression test if missing.** If the failure was a real bug not previously tested. ### Hermetic over E2E for bug repro When a bug is reported: 1. Try to reproduce in a unit test against the suspect module. Pin it. 2. If that's not enough, integration test wiring stores + handlers in-process. 3. E2E only if cross-process / browser-only behavior. A 2-second unit test that fails reliably beats a 60-second E2E that flakes. ### Tests assert on codes, not messages ```ts // ✗ wrong — breaks on intl / copy change expect(err.message).toContain("not authorized"); // ✓ right expect(err.code).toBe("AUTH_FORBIDDEN"); ``` ```tsx // ✗ wrong — breaks on intl / copy change expect(screen.getByText("Save")).toBeInTheDocument(); // ✓ right expect(screen.getByRole("button", { name: /save/i })).toBeInTheDocument(); ``` ### Determinism - No file system writes outside per-test temp dirs. - No network calls (mock the boundary). - No clock drift (inject the clock). - No global module state. A test that passes in isolation and fails in parallel has hidden global state. Fix the test, not the order. ### Common failure modes - **Inverted pyramid.** Mostly E2E. Slow + flaky. → Push to lower tiers. - **Flake "fixed" by `setTimeout`.** Hidden flake. → Find the deterministic signal; `expect.poll()` / `waitFor()`. - **Coverage 95% but error codes never asserted.** → Separate gate scans for asserted codes. - **Tests share fixtures via mutation.** Order-dependent. → Fresh fixtures per test. - **Mock at every layer.** End up testing the mocks. → Mock at the trust boundary. ### Exit criteria Test is continuous, like Build. Each cycle exits when: 1. New behavior has its test in the same PR. 2. Coverage thresholds hold. 3. Suite runs deterministically in CI. Pre-release adds: full mutation pass, full a11y pass, cold-prod walk of the demo script. ### See also - [`../../pillars/quality/test-pyramid.md`](../../pillars/quality/test-pyramid.md) - [`../../pillars/quality/mutation-testing-pattern.md`](../../pillars/quality/mutation-testing-pattern.md) - [`../../pillars/architecture/error-hierarchy.md`](../../pillars/architecture/error-hierarchy.md) - [`../../pillars/ui-ux/a11y-checklist.md`](../../pillars/ui-ux/a11y-checklist.md) ==== https://playbook.agentskit.io/docs/phases/05-ship --- title: 'Phase 05 — Ship' description: 'How to turn a green main into a release without surprising consumers.' --- # Phase 05 — Ship How to turn a green main into a release without surprising consumers. ## TL;DR (human) Tests green is not enough. A release-gate checklist runs structural gates + sanity + cold prod-build walk + changesets + security review. Versions bump per semver. Artifacts get signed. Rollback plan is written before, not after. ## For agents ### Pre-release outputs - [ ] **`pnpm check:all`** — full pre-release sweep green. - [ ] **Sanity report CLEAN** — no metric regressions vs last release baseline. - [ ] **Release-blockers empty** — no open issue tagged `release-blocker`. - [ ] **Changesets generated** — every consumer-visible PR has one; aggregated into the release notes. - [ ] **Cold prod-build walk** — operator-driven; literal demo script; recorded. - [ ] **Security sweep** — pending advisories triaged; threat model reviewed for new surfaces. - [ ] **Signed artifacts** — binaries / installers signed; SHA / GPG attested. - [ ] **Rollback plan** — how to revert; who pushes; on-call assigned. ### Per pillar — Ship-phase discipline **Architecture** - [ ] Peer-dep compat matrix updated. - [ ] Public API diff produced; breaking changes called out. - [ ] ADRs for this release tagged with the release version. **Security** - [ ] Dependency vulnerability triage (CVE list). - [ ] Threat model walked: new surfaces, new mitigations, new residuals. - [ ] Key rotation review (any keys past their rotation date?). - [ ] Audit ledger verification job green. **UI-UX** - [ ] Per-locale parity verified. - [ ] Brand-kit matrix tested (default + test brand both render cleanly). - [ ] A11y full sweep on changed screens. - [ ] Demo walk-through script executed on cold prod build. **Quality** - [ ] `check:all` green. - [ ] Mutation pass scheduled (or last result acceptable). - [ ] Coverage at threshold per package. - [ ] No flaky-tests baseline regressions. **Governance** - [ ] Changesets present per consumer-visible PR. - [ ] Tombstones applied to retired plans. - [ ] Release notes drafted (user-facing + internal). **AI-collaboration** - [ ] Memory groomed: facts that are now release-stable promoted to `CLAUDE.md` / `AGENTS.md`. - [ ] Sub-agent recipe updates rolled out (if any). ### Cold prod-build walk (non-skippable) CI green is not "demo reachable". A green build can hide: - Route-gating bugs that prod-only conditions trigger. - Bundler-mode differences (dev vs prod). - SSR vs CSR boundary mistakes. - Environment-variable misses. Therefore: 1. Build a production artefact from `main` on a clean checkout. 2. Walk the literal demo script: every step a customer / investor would take. 3. Record outcome. Screenshots / video. 4. Any unhandled state → SHIP-BLOCK. File issue; fix; redo walk. This is operator-driven. Agents cannot self-verify a cold walk because they cannot click. They prepare the artefact and the script; the operator runs it. ### Investor / customer percent claims The percent-ready number you state externally comes from the cold walk, not from CI. CI-derived percentages are a sanity check; they are not the truth. Reading too much into "tests pass" produces overclaims that hurt at demo time. ### Versioning Semver per package. Conventions: - **patch** (`x.y.Z+1`) — internal fixes, no consumer-visible API change. - **minor** (`x.Y+1.0`) — additive changes; new method, new field with default. - **major** (`X+1.0.0`) — breaking change. Requires RFC + migration plan + deprecation window honoured. Aggregate the changesets; verify each PR's classification matches its diff. ### Rollback plan Per release: - **What changed**: list of consumer-visible changes. - **Smoke tests post-release**: which surfaces / flows to verify within N minutes. - **Rollback procedure**: exact commands; expected duration. - **Rollback authority**: who can trigger; under what condition. - **Comms plan**: who tells users; via which channel. The plan exists before the release tag. Writing it post-release is too late. ### Common failure modes - **"Investor-ready 90% because tests pass."** Overclaim; demo crashes. → Cold walk dictates the number. - **Release with no changesets.** Consumers cannot see what changed. → Per-PR changeset; aggregate at release. - **Auto-publish from CI on tag.** Rollback path unclear if it goes wrong. → Manual trigger; explicit on-call. - **Tombstones not applied.** Release ships with retired plans claiming work that is done. → Sweep tombstones at release branch. ### Exit criteria (release tag) You can tag a release when: 1. All checkboxes above ticked. 2. Cold walk complete with no blockers. 3. Rollback plan signed off. 4. Release notes published (or queued for publish at tag). ### See also - [`../../pillars/quality/sanity-pattern.md`](../../pillars/quality/sanity-pattern.md) - [`../../pillars/governance/tombstone-pattern.md`](../../pillars/governance/tombstone-pattern.md) - [`../../prompts/slash-ship.md`](../../prompts/slash-ship.md) — orchestrator slash command for this checklist. ==== https://playbook.agentskit.io/docs/phases/06-operate --- title: 'Phase 06 — Operate' description: 'What ''running it'' looks like after agents have shipped the first release.' --- # Phase 06 — Operate What "running it" looks like after agents have shipped the first release. ## TL;DR (human) Operate phase keeps the system trustworthy and the codebase clean over months / years. Incidents have runbooks. Bug hunts run on cadence. Backups restore-drilled quarterly. Keys rotate. Dependencies updated. Tombstones applied to retired work. The phase never ends; the disciplines compound. ## For agents ### Outputs (ongoing) - [ ] **Incident response playbook** — who pages, where to look, how to communicate. - [ ] **Runbooks** per critical surface (auth, audit, vault, billing, releases). - [ ] **Backup / DR procedure** — full-state backup; encrypted; offsite; restore-drilled quarterly. - [ ] **Key rotation calendar** — what rotates when; who triggers. - [ ] **Bug-hunt cadence** — scheduled, scoped hunts producing real defect lists. - [ ] **Cost / spend monitor** — budget alerts; cost-guard policies. - [ ] **Telemetry pipeline** — opt-in; PII-redacted; sampled. - [ ] **Compliance evidence** — DSAR procedure, legal-hold procedure, audit log retention. - [ ] **Tombstone discipline** — retired plans / docs / surfaces marked, not deleted. ### Per pillar — Operate-phase discipline **Architecture** - [ ] Track tech debt against ADRs; supersede ADRs when reality drifts. - [ ] Periodic structural review: are package boundaries still right? - [ ] Quarterly: review file-size baselines; pick top-N largest to shrink. **Security** - [ ] Key rotation per calendar. - [ ] Audit ledger verification job runs daily; alerts on mismatch. - [ ] Threat model walked per release; new surfaces added. - [ ] Vulnerability triage SLA (e.g. critical within 24h, high within 7d). - [ ] Penetration test cadence (annual / pre-major-release). **UI-UX** - [ ] Empty-state honesty audits — surfaces still tell the next step. - [ ] Intl coverage per locale — keys do not drift. - [ ] A11y full sweep before each release (changed screens) + annual full sweep. - [ ] Brand-kit verification on a representative tenant per release. **Quality** - [ ] Mutation testing on stable utility modules. - [ ] Bug-hunt phases — periodic, scoped (per area / per layer); produces real defects. - [ ] Gallery / flake budget — flake rate tracked; flake-fix cadence. - [ ] Sanity audit weekly; pre-release CLEAN required. **Governance** - [ ] Postmortems for any user-facing incident. - [ ] Tombstone retired plans, ADRs, surfaces. - [ ] Release engineering retros after each release. - [ ] Quarterly: review open RFCs; close stale; promote ready. **AI-collaboration** - [ ] Memory grooming — duplicates merged; stale entries deleted. - [ ] Sub-agent recipes updated as lessons land. - [ ] CLAUDE.md / AGENTS.md kept current as the codebase evolves. - [ ] Periodic agent-onboarding test: a fresh session, working from a fresh checkout, can a new agent be productive in 30 minutes? ### Incident response shape A useful incident playbook answers, in order: 1. **Detection** — how did we find out? (alerting / customer / agent / internal) 2. **Triage** — severity (SEV-1..4); who's on call; who's IC. 3. **Containment** — short-term action to limit impact. 4. **Diagnosis** — root cause investigation. 5. **Remediation** — the fix; tests added to prevent recurrence. 6. **Postmortem** — within 5 business days; blameless; produces ADR updates / new gates. Postmortems are tombstoned not deleted; they form the historical record of failure modes — agents reading them learn future-proofing. ### Bug-hunt cadence A bug hunt is a **scheduled phase** with a defined scope: - One area (e.g. "audit ledger writes", "OAuth callback flow"). - One layer (e.g. "boundary validation across all methods"). - Time-boxed (1 week typically). - Produces a defect list with severity + reproducer. Bug hunts are not "more testing"; they are *adversarial* — hunting for real defects, not raising coverage numbers. ### Key rotation calendar | Key | Cadence | Audit | |---|---|---| | Vault sealer (KEK) | Quarterly | Ledger entry per rotation | | Audit ledger signer | Quarterly | Verifiable from public keys retained | | Session signing key | Quarterly | Sessions migrate at rotation | | Per-secret rotation (high-value) | Quarterly | Vault audit | | Per-secret rotation (low-value) | Annually | Vault audit | | Connector OAuth tokens | Per-token expiry; refresh flow | Per-refresh log | | Compromise rotation | Immediate | Ledger + comms | Calendar entries with owners — not just "we should rotate". ### Tombstone discipline (Operate-specific) In Operate, tombstones accumulate. Periodic archival: 1. Tombstoned files older than N months → move into `_archive/`. 2. Update back-references. 3. Index files list archived items in a collapsed section. Never delete the body. The historical record is part of the asset. ### Cost / spend monitoring - Budget per workspace / account / tenant. - Alert at 50%, 75%, 90%, 100% of budget. - Hard cap at 110% (or per policy). - Anomaly detection: sudden 10× spike → pause + page. - Audit-log every spend-policy change. ### Investor / due-diligence readiness For repos sold or audited: - Decision log (`docs/adr/`) is the architecture artefact. - Threat model (`docs/security/threat-model.md`) is the security artefact. - Audit ledger is the compliance artefact. - Tombstoned plans + postmortems are the maturity artefact (proves you remember). - Sanity report archive is the quality artefact. A new reviewer should be able to read those five and form a complete picture of the system within an hour. ### Exit criteria Operate has no exit; the project is in Operate for its entire lifetime. Each cycle (typically per release) checks: 1. Incident response was clean (no SEV-1 unhandled). 2. Backup restore drill succeeded this quarter. 3. Key rotations on calendar happened. 4. Bug hunt for this cycle ran. 5. Tombstones applied; archive groomed. ### See also - [`../../pillars/security/audit-ledger-pattern.md`](../../pillars/security/audit-ledger-pattern.md) - [`../../pillars/security/vault-pattern.md`](../../pillars/security/vault-pattern.md) — rotation. - [`../../pillars/quality/sanity-pattern.md`](../../pillars/quality/sanity-pattern.md) - [`../../pillars/governance/tombstone-pattern.md`](../../pillars/governance/tombstone-pattern.md) - [`../../pillars/ai-collaboration/memory-pattern.md`](../../pillars/ai-collaboration/memory-pattern.md) — grooming. ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration --- title: 'Pillar — AI Collaboration' description: 'How to make an agent productive in your repo on day one and durably good across sessions.' --- # Pillar — AI Collaboration How to make an agent productive in your repo on day one and durably good across sessions. ## Status ◐ Scoped, not yet detailed. This is the most distinctive pillar of the playbook — it captures lessons that have no analogue in pre-agent software development. ## Scope | Concern | Universal principle | Concrete pattern | |---|---|---| | Bootstrap doc | One file an agent reads first, every session | `CLAUDE.md` (or `AGENTS.md`) at the repo root with non-negotiables + routing | | Routing table | Map "I want to change X" → "edit path Y" | `AGENTS.md` table; agents triage faster by group than by file | | Persistent memory | Lessons survive session ends | `MEMORY.md` (index) + `memory/*.md` (one fact per file) pattern | | Goal mode | Agent works toward a condition, not a turn count | Stop hook with goal condition; clears when the condition holds | | Sub-agents | Long fan-outs delegate to scoped specialist agents | Sub-agent recipes per task class (search, plan, review, implement) | | Slash commands | Repeated workflows become palette entries | `/goal`, `/loop`, `/review`, `/clear`, plus project-specific | | System prompts | Per-role prompts (architect, reviewer, fixer) | Reusable role files; injected per task | | Verify-first | Before acting, confirm the state is what you think it is | Default `gh issue view`, `git fetch`, `pwd` at session start | | Single sub-unit | One discrete shippable change per session | Defined up front; no scope creep | | Honest reporting | Faithful state, not optimistic state | "Tests failed: \", not "Tests pass after I fix the unrelated thing" | | Duplication detection | Verify against real exports, not doc names | `npm pack` + read `.d.ts`, never trust naming similarity | | Concurrent-merge survival | Multiple agents pushing to main | Stash-verify red, rebase clean, retry; pre-push hook covers structural drift | ## Non-negotiables 1. **CLAUDE.md / AGENTS.md is mandatory.** No agent starts work without one. 2. **Persistent memory grows from lessons, not from chat.** Each memory file is a fact + how to apply it. 3. **Verify-first.** State at session start may not match state at PR open. 4. **Honest reporting.** Tests that failed are reported failed. Steps skipped are reported skipped. 5. **One sub-unit per session.** Quality over speed. ## See also - [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md) — bootstrap doc skeleton. - [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md) — routing table skeleton. - [`../../templates/MEMORY.md.template.md`](../../templates/MEMORY.md.template.md) — persistent memory skeleton. - [`../../prompts/`](../../prompts/) — system prompts + sub-agent recipes. ## Roadmap - `universal.md` - `bootstrap-doc-pattern.md` - `memory-pattern.md` - `sub-agent-pattern.md` - `slash-commands-pattern.md` - `concurrent-agent-pattern.md` ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/bootstrap-doc-pattern --- title: 'Bootstrap Doc Pattern' description: 'The file an agent reads first, every session. Two files together: one for non-negotiables, one for routing.' --- # Bootstrap Doc Pattern The file an agent reads first, every session. Two files together: one for non-negotiables, one for routing. ## TL;DR (human) `CLAUDE.md` (or `AGENTS.md` per your toolchain) at the repo root, loaded automatically. It carries the eight non-negotiables, the build commands, and a pointer to a separate `AGENTS.md` (routing table). Under 200 lines. Updated only when the rules change. ## For agents ### Why two files `CLAUDE.md`: non-negotiables. Stable. Mirror of the rules in [`../../README.md`](../../README.md). Updated rarely. `AGENTS.md` (or equivalent): routing. Volatile. Updated when packages get added, renamed, merged. Lists every package + which surface it owns. Why separate: the routing changes far more often than the rules. Keeping them in one file forces a rule re-read every time a package is renamed, which is wasteful. Keeping them separate lets the rules cache in the agent's working memory across sessions while routing stays current. ### `CLAUDE.md` shape Six sections, in order: 1. **Title + one-paragraph repo at a glance.** Stack, package count, app count, top-level layout. 2. **Pointer to the canonical doc.** "`AGENTS.md` is the routing table — read it first when you don't know which package to touch." 3. **Non-negotiables.** The eight-rule kernel (see [`../../README.md`](../../README.md)) trimmed to what applies to this codebase, numbered. 4. **Before you ship.** The exact commands (lint, test, gate). Per-package versus whole-repo distinction. 5. **Where to look next.** Five-row table mapping intent to doc path. 6. **When a doc contradicts the code.** "The code wins. Update or remove the doc." Template: [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md). ### `AGENTS.md` shape 1. **TL;DR philosophy** — 3–5 numbered statements. Why this codebase looks the way it does. 2. **Mental map** — 4–7 logical groups, each with the packages in that group and the concern they own. Agents triage faster by group than by alphabetical name. 3. **Routing table** — two-column: "I want to change…" → "Edit…". 4. **Workflow** — verify-first; one sub-unit; intent manifest; self-review. 5. **When something is unclear** — escalation path (read for-agents doc → ADR → code → open `discuss:` issue). Template: [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md). ### Length discipline - `CLAUDE.md`: ≤ 200 lines (the agent reads it whole every session). - `AGENTS.md`: ≤ 400 lines; if it grows past that, split per-package detail into `docs/for-agents/packages/\.md` and link. If either file is over budget, agents skim instead of read. Skim defeats the purpose. ### Versioning These files are part of the code. They go through PR review. Changes to the non-negotiables require an ADR (the rule change IS an architecture decision). Routing-table changes do not require an ADR — they reflect the codebase, which itself went through ADRs. ### Gate Recommended automated checks: 1. **Existence** — `CLAUDE.md` (or `AGENTS.md`) must exist at the repo root. CI fails otherwise. 2. **Size budget** — `CLAUDE.md` ≤ 200 lines. 3. **Routing currency** — every package in the workspace appears at least once in `AGENTS.md`'s mental map or routing table. Stale entries (referring to deleted packages) fail. Reference impl: a `check-agent-docs.example.mjs` in [`../../scripts/`](../../scripts/) (ship in a future session). ### Common failure modes - **One mega-file.** Non-negotiables + routing + per-package detail all in one place; 1500 lines; agents read the first 300 and miss the rest. → Split into the two-file pattern. - **Routing table out of date.** Lists a package that was renamed 3 weeks ago. Agents follow it, fail. → Make routing currency a gate. - **Non-negotiables marketed as "guidelines".** Hedged language ("try to", "prefer"). Agents treat hedged rules as optional. → Imperative voice. "No `any`." not "Avoid `any` when possible." - **`CLAUDE.md` references files that no longer exist.** → Gate that resolves every relative link. ### See also - [`memory-pattern.md`](./memory-pattern.md) — `MEMORY.md` index loads alongside `CLAUDE.md`. - [`../architecture/universal.md`](../architecture/universal.md) — the non-negotiables come from here. - [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md), [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md). ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/concurrent-agent-pattern --- title: 'Concurrent Agent Pattern' description: 'How to survive — and benefit from — multiple agents working in the same repo at the same time.' --- # Concurrent Agent Pattern How to survive — and benefit from — multiple agents working in the same repo at the same time. ## TL;DR (human) Your branch is not the only branch. Issues you plan to fix may be fixed concurrently. Main moves under your feet. Defensive practices: verify-first at session start, rebase not merge, stash-verify red before "fixing" CI, never `--theirs`/`--ours` without justification. ## For agents ### The hazards When N agents work in one repo, the rate of one specific class of waste goes up linearly: | Hazard | Symptom | |---|---| | Duplicate work | Two agents fix the same issue in parallel; one PR gets closed as duplicate | | Stale issue | Agent grinds on an issue that was closed by a peer mid-session | | Conflict storm | Agent rebases and finds the file they were editing was deleted | | Silent revert | Agent uses `--theirs` to resolve a conflict; peer's work is dropped | | False red | Agent "fixes" a CI failure that was actually pre-existing on main | | Doc drift | Agent updates a doc that another agent just rewrote | ### Defensive checklist (session start) Run this every session, not just the first session in a sprint: 1. `git fetch origin --prune` 2. `git status` — clean tree expected; if dirty, stash with a tag. 3. `git log origin/main..HEAD` and `git log HEAD..origin/main` — see what diverged. 4. `gh issue view \ --json state` for every issue you plan to touch. 5. `gh pr list --search "is:open touch-path:\"` — peer PRs touching your files. 6. If you find peer activity in your path: read those PRs before starting; you may be redundant. This costs ~30 seconds. It prevents hours of wasted work. ### Defensive checklist (before push) 1. `git fetch origin` 2. Rebase onto fresh `origin/main`. Resolve conflicts (see "Conflict policy" below). 3. Re-run quality gates on the rebased branch. 4. `gh issue view \ --json state` again — the issue may have closed while you worked. 5. If the issue closed: do **not** push a dup PR. Comment on the closing PR if your work has additional value; otherwise drop the branch. ### Conflict policy Conflicts surface where your work meets peer work. Two failure modes: 1. **`git checkout --theirs` / `--ours`.** Drops one side's work entirely. Almost never the right answer. → If you must use this, the PR-intent manifest must include `merge-override: \`. The reviewer verifies the override is justified. 2. **Hand-merged conflict that quietly mis-orders things.** Diff looks clean; behavior is broken. → After any conflict resolution, re-run the affected tests. Always. Default: **rebase, resolve hunk-by-hunk, keep both sides where they coexist.** Merges should sum. ### Stash-verify-red Before "fixing" a CI failure: 1. Stash your changes. 2. Check out fresh `origin/main`. 3. Re-run the failing job. 4. If it fails on clean `origin/main`: the failure is pre-existing. **Do not blame your branch.** File an issue; either pick up the fix yourself in a separate PR, or revert your stash and continue on your sub-unit. 5. If it passes on clean `origin/main`: the failure is yours. Apply the stash and fix. This prevents agents grinding on imaginary failures and prevents peer pressure to "make the red go away" without diagnosing. ### Worktrees for parallel work If you run multiple sessions or multiple agents on one machine: - One worktree per branch (`git worktree add ../\-\ \`). - Each worktree is fully independent — separate `node_modules`, separate `dist`, separate everything that lives in `.gitignore`. - Never edit the same file in two worktrees. The second edit will conflict at push. Worktrees minimize "stash-restore" thrash. A cheap defensive practice. ### Verify-first before close-out Before closing an issue you "fixed": 1. `gh issue view \ --json state` — confirm still open. 2. Read the issue body again — confirm the DoD is what you fixed. 3. Cross-check your PR against the DoD line-by-line. 4. Look at peer-closed PRs that reference the issue — maybe they already closed it and your work is redundant. This was the single highest-yield discipline in production: catching dup work *after* an agent finished implementing it (because the issue closed mid-session) is wasteful but recoverable; catching it before the PR is open saves the entire impl cost. ### Memory updates from concurrent-agent events When concurrent work surprises you, write a memory: - The path that had peer activity (so next session you check it). - The issue you found closed (so next session you do not pick it up). - The conflict resolution pattern that worked (so next session you reuse it). See [`memory-pattern.md`](./memory-pattern.md). ### Common failure modes - **Agent picks high-contention issue from a popular epic.** Maximises duplicate-work probability. → Prefer low-parallelism issues; avoid hot epics unless explicitly assigned. - **Agent assumes main is stable.** Pushes; CI red because peer landed a refactor 10 min ago. → Always `git fetch` before push. - **Agent uses `--theirs` to "win" a conflict.** Drops the other agent's work silently. → Lint / gate the PR-intent manifest to require `merge-override:` annotation when these flags appear in the diff. - **Agent grinds on a pre-existing red.** Wastes hours "fixing" a failure that was never theirs. → Stash-verify-red protocol. ### See also - [`../governance/README.md`](../governance/README.md) — merge rules + PR-intent removes-list. - [`universal.md`](./universal.md) — Rule 9. - [`memory-pattern.md`](./memory-pattern.md) — log what surprised you. ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/memory-pattern --- title: 'Memory Pattern' description: 'How to make agents durably learn from prior sessions without polluting context with chat transcripts.' --- # Memory Pattern How to make agents durably learn from prior sessions without polluting context with chat transcripts. ## TL;DR (human) One fact per file. Frontmatter typed (`user`, `feedback`, `project`, `reference`). Index file (`MEMORY.md`) loads every session — short, one-line-per-memory. Memories link to each other with `[[name]]`. Lessons land the moment they happen, not at session end. ## For agents ### File layout ``` .agent-memory/ (or wherever your toolchain looks) ├── MEMORY.md # the only file loaded every session ├── user_.md # who the user is ├── feedback_.md # how the user wants the agent to work ├── project_.md # ongoing work / constraints └── reference_.md # external pointers (URLs, dashboards, tickets) ``` ### Per-memory frontmatter ```markdown --- name: description: metadata: type: user | feedback | project | reference --- ``` ### Index file (`MEMORY.md`) One line per memory. No frontmatter. No body content. Format: ```markdown - [Title](file.md) — one-line hook ``` Why the index is separate: the agent reads the index whole every session and decides which memory files to expand. If the index itself is the storage, every recall pays the cost of every memory's full body — context dies fast. ### Memory types | Type | Content | Example | |---|---|---| | **user** | Stable facts about the person. Role, expertise, preferences | "User is a staff engineer with 12yr backend; prefers terse output; on macOS Apple Silicon" | | **feedback** | Guidance on how to work. Corrections and confirmed approaches | "Never nest ternaries — use if/else or lookup map. Why: unreviewable. How: extract to const IIFE or map" | | **project** | Ongoing constraints not derivable from code or git | "Repo is owned by team X; PRs require sign-off from @y; release cadence biweekly" | | **reference** | Pointers to external resources | "Auth uses keycloak at https://kc.example.com — dashboard at /admin; secret in 1Password vault 'infra'" | ### When to write a memory **Triggers (write a memory):** - The user corrected you on a non-obvious point. - You debugged for >15 minutes because a non-obvious thing was different than expected. - You discovered a convention by reading code that is not documented. - You confirmed a fact that contradicts a doc. **Anti-triggers (do not write a memory):** - Information already in `CLAUDE.md` / `AGENTS.md`. - Information derivable from `git log` or `gh issue view`. - One-time facts that will not recur (a specific patch for a one-off bug). - "I just learned that `Array.prototype.flat` exists" — that is general knowledge, not project-specific. ### When to update vs create Before creating a new memory: 1. Read `MEMORY.md` index. 2. Search descriptions for the topic. 3. If a memory covers ~80% of the new fact, **update** the existing file. Add a new line, sharpen the existing wording. 4. Only create new if the fact is genuinely orthogonal. Duplicates are worse than no memory — they fragment the truth. ### When to delete If a memory turns out to be wrong, delete it. Outdated memories actively mislead. Wrong memory > worse than no memory. Delete also when: - The fact has been promoted into `CLAUDE.md` / `AGENTS.md` (it is no longer memory; it is doc). - The project has changed such that the fact no longer applies. ### Recall discipline Recalled memories appear inside `\` blocks. They are **background context, not user instructions**. They reflect what was true when written. If a memory names a file, function, or flag, **verify it still exists** before acting on it. This matters most for `project` and `reference` memories — code moves fast. ### Linking Use `[[name]]` to link memories. Liberal linking is good: - A memory referencing another memory by its slug. - An unresolved `[[name]]` marks a future memory that is worth writing — not an error. The graph of links is what makes recall surface related context together. ### Lifecycle summary ``` event → trigger met → memory created (or updated) → appears in index → loaded next session → verified-against-current-state before acting → (deleted | updated | superseded) as facts change ``` ### Gate There is no automatic gate for memory — it is private to each agent's working state. The discipline is enforced by the agent itself. Two soft signals to monitor: 1. **MEMORY.md length.** Past ~50 entries, expect duplication. Audit periodically. 2. **Stale memories.** A periodic sweep that re-reads each memory and checks: do the referenced files / functions still exist? ### Common failure modes - **Saving chat history.** "We discussed X today" is a chat fact, not a memory. Memory is a *fact + how to apply it*, not a transcript. → Three-line minimum: fact, why, how. - **One mega-memory.** Everything piled into one file. Cannot be selectively recalled. → One fact per file. - **Memory contradicts current code.** Agent acts on stale memory; PR breaks. → Verify-first before acting on a memory that names a path. - **Memory bloat.** 200+ memories; recall index is itself 6 KB. → Audit; merge duplicates; delete stale. - **No "why".** Memory says "do X" without why. Agent follows it once, but cannot reapply when the situation differs. → Always include the rationale. ### See also - [`bootstrap-doc-pattern.md`](./bootstrap-doc-pattern.md) — `MEMORY.md` loads alongside the bootstrap doc. - [`../../templates/MEMORY.md.template.md`](../../templates/MEMORY.md.template.md) — copy-paste skeleton. - [`universal.md`](./universal.md) — Rule 10 (lessons land the moment they happen). ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/slash-commands-pattern --- title: 'Slash Commands Pattern' description: 'How to turn repeated workflows into palette-invoked commands so they run identically every time.' --- # Slash Commands Pattern How to turn repeated workflows into palette-invoked commands so they run identically every time. ## TL;DR (human) A slash command is a named, palette-invoked prompt template. One command = one well-scoped workflow. The body is a prompt, not a script. Side-effects (push, merge, deploy) require explicit confirmation in the body. ## For agents ### Anatomy A slash command has: - **Trigger** — `/\` typed by the user (or auto-invoked by another agent / hook). - **Args** — optional positional or named arguments parsed from the trigger line. - **Body** — a prompt template that instructs the agent. - **Tools** — the set the command is allowed to use. - **Confirmation** — for side-effecting commands, an explicit "are you sure?" or annotation requirement. ### When to make a slash command Make one when **all** of the following hold: - The workflow runs ≥3 times per week. - The workflow has 5+ steps that benefit from being stated once. - The prompt body is the same each time (the args parameterise the variable part). - A human would rather type `/\ \` than re-type the prompt. If any condition fails, do not make a slash command — make a script or a snippet instead. ### Canonical commands | Command | Purpose | Args | |---|---|---| | `/goal \` | Set a session goal + stop hook | the success condition | | `/loop [\] \` | Recurring or self-paced runs | interval (optional), command body | | `/review [\]` | Multi-agent PR review | PR number; defaults to current branch | | `/clear` | Reset session context cleanly | none | | `/plan \` | Spawn `plan` sub-agent on the task | task description | | `/ship` | Run the release-gate checklist | none | | `/sanity` | Run cross-cutting audit, surface drift | none | Project-specific commands go on top. Examples worth defining: - `/issue-from-bug "\"` — files a structured bug report from a one-line description. - `/promote-rfc \` — promotes an accepted RFC to an ADR. - `/tombstone \` — adds a tombstone block to a retired doc. ### Body discipline The body is a prompt. It should: 1. **State the goal in imperative voice.** "Open a PR with…" not "I would like to open…". 2. **List the steps explicitly.** Number them. Each step is one action. 3. **Name verification points.** "Verify gates green before merging" — explicit, not implied. 4. **Require confirmation for side-effects.** "Confirm with user before pushing." 5. **State the exit condition.** "Done when the PR is merged and the branch is deleted." Anti-pattern: a body that says "do the right thing". The slash command exists *because* "the right thing" was being done inconsistently. ### Side-effect confirmation Commands that mutate outside the local checkout — push, merge, deploy, file an issue on someone else's behalf, send a Slack message — require **explicit confirmation in the body**: ``` Before running `gh pr merge --admin`, summarize the diff to the user and wait for an explicit "yes, merge" reply. Do not merge based on prior approval in a different context. ``` Reason: approval in one context does not extend to the next. A user who said "merge it" 30 minutes ago for a different PR is not approving this one. ### Versioning Slash command bodies are part of the repo (commit them; do not rely on per-user toolchain dotfiles for shared workflows). Changing a slash command body is a PR. Reviewers verify the change. ### Common failure modes - **Slash command does too much.** `/ship` that bumps version, opens PR, merges, deploys, posts to Slack. One step fails; partial state. → One command = one well-scoped workflow. - **Body relies on implicit context.** "Use the standard PR template" — which one? → Inline the template or link explicitly. - **No confirmation on side-effect.** `/deploy` merges and pushes to prod without asking. → Confirmation in body. - **Args parsed loosely.** `/loop` with ambiguous interval; agent defaults to "every minute" and burns budget. → Strict arg parsing; default to safe (long interval). - **Slash command not shared.** Each agent has a different version in their dotfiles. → Commit shared commands to the repo. ### Loop discipline `/loop` deserves special care. It is the command that most often goes wrong: - **Pick the right interval.** Burning the prompt cache every 30s costs more than every 20 minutes; the prompt cache TTL is short. - **Default to long fallbacks.** If the loop is waiting on external state, the wake-up should be tied to that state, not a fixed timer. - **Have a clear exit.** A loop with no exit will run until budget runs out. ### See also - [`sub-agent-pattern.md`](./sub-agent-pattern.md) — many slash commands delegate to a sub-agent. - [`../../prompts/README.md`](../../prompts/README.md) — command bodies live here. - [`universal.md`](./universal.md) — Rule 7 (explicit goal, explicit exit). ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/sub-agent-pattern --- title: 'Sub-agent Pattern' description: 'How to delegate scoped tasks to specialist agents so the orchestrator stays focused.' --- # Sub-agent Pattern How to delegate scoped tasks to specialist agents so the orchestrator stays focused. ## TL;DR (human) Sub-agents are scoped specialists. Each gets a narrow task, a narrow toolset, and returns a summary, not a file dump. Use them for searches, plans, reviews, and parallel implementations. Tier the model size by task complexity. ## For agents ### When to delegate Delegate when **any** of the following is true: - The task means reading across many files and you only need the conclusion. - The task is independent of what you are doing now (can run in parallel). - The task fits a recurring shape (search, plan, review, implement) you already have a recipe for. - Your context is filling and the task does not need your conversation history. Do **not** delegate when: - You already know the answer (a single grep). - The task requires your conversation history (sub-agents do not see it). - The task is destructive or hard to reverse (you want it on your own conscience). ### Recipe shape A sub-agent recipe specifies: 1. **Role.** One sentence: what kind of agent this is. 2. **Tools.** The minimum tool set. Less is more — narrow tools force focus. 3. **Inputs.** What the orchestrator must pass. 4. **Outputs.** The exact shape of the summary returned to the orchestrator. 5. **Stop condition.** When the sub-agent is done. ### Canonical recipes | Recipe | Role | Tools | Stop when | |---|---|---|---| | `explore` | Read-only search across files | read, grep, glob, ls | Found the file / symbol asked for, returns excerpts + paths | | `plan` | Step-by-step implementation plan | read, web fetch | Plan written; orchestrator owns execution | | `code-explorer` | Trace execution paths, map dependencies | read, grep, glob | Diagram + dependency list returned | | `code-reviewer` | Confidence-filtered review pass | read, git diff | Issues returned with confidence scores; orchestrator decides which to fix | | `implementer` | Build a sub-unit against a finalised plan | read, edit, write, bash | PR-ready diff, tests green | | `security-reviewer` | Security review of pending changes | read, git diff | Findings + severity list | ### Model tiering Smaller models on smaller tasks. Reserve the largest model for what needs deep reasoning. | Task complexity | Model tier | Examples | |---|---|---| | Trivial | small (haiku-class) | Find a file by name; grep for a symbol; list a directory | | Simple | medium (sonnet-class) | Write documentation; write unit tests; review code | | Complex | large (opus-class) | Architect a feature; design a contract; resolve a tricky merge | Mis-tiering hurts both directions. Putting a small model on architecture wastes time. Putting a large model on a `grep` wastes money. ### Outputs are summaries A sub-agent returns a **summary to the orchestrator**, not a file dump. The orchestrator pastes the summary to the user / next agent — the sub-agent's full transcript is invisible. This means: - The sub-agent's last message must be self-contained. - It cites file paths + line numbers in clickable form (`path:line`). - It does not include long excerpts unless asked. - It explicitly says "done" or "blocked on X" — no ambiguity. ### Parallelism Independent sub-agents run in parallel. The orchestrator launches all of them in one batch, waits for results, then proceeds. Rule: if N tasks share no data dependency, launch all N at once. Sequential launch wastes wall-clock time. ### Continuation vs new spawn Two ways to talk to a sub-agent again: - **Continue** the existing one (your toolchain has a "send message to agent \") — preserves its context. - **Spawn a new one** — fresh context, no recall. Continue when the task is an extension of the prior one. Spawn fresh when the task is unrelated; carrying old context bloats the new task. ### Common failure modes - **Sub-agent that needed orchestrator context.** "Implementer" launched without the plan; produces something off-spec. → Pass the plan as input. - **Orchestrator re-runs a search the sub-agent already did.** Wastes time. → Trust the sub-agent's summary; ask follow-ups if needed. - **Sub-agent given every tool "just in case".** Wanders. → Narrow toolset. - **Mis-tiered model.** Opus on a grep; haiku on architecture. → Tier by task class, not by "best available". - **Sub-agent transcript leaked to user as primary output.** User now sees raw exploration noise. → Orchestrator distills the summary; transcript is internal. ### See also - [`../../prompts/README.md`](../../prompts/README.md) — recipe index (bodies in a future session). - [`slash-commands-pattern.md`](./slash-commands-pattern.md) — slash commands often wrap a sub-agent. - [`universal.md`](./universal.md) — Rule 8. ==== https://playbook.agentskit.io/docs/pillars/ai-collaboration/universal --- title: 'AI Collaboration — Universal Principles' description: 'How to get production-grade work out of an AI coding agent, durably, across sessions and across multiple agents working in parallel.' --- # AI Collaboration — Universal Principles How to get production-grade work out of an AI coding agent, durably, across sessions and across multiple agents working in parallel. ## TL;DR (human) Ten rules. They are stack-agnostic, model-agnostic (Claude / GPT / Gemini / open-weights), and tool-agnostic (Cursor / Copilot / Claude Code / Aider / Roo / your CLI). Adopt all ten or expect specific failure modes to repeat. 1. Bootstrap doc at the repo root, loaded every session. 2. Routing table that maps intent to file path. 3. Persistent memory as one-fact-per-file, not chat history. 4. Verify-first before any action. 5. One sub-unit per session. 6. Honest reporting — failures stated, not glossed. 7. Explicit goal, explicit exit condition. 8. Delegate fan-outs to scoped sub-agents. 9. Concurrent-agent awareness — your branch is not the only branch. 10. Lessons land in memory the moment they happen. ## For agents ### Rule 1 — Bootstrap doc at the repo root A file named `CLAUDE.md`, `AGENTS.md`, or `.cursorrules` (per your toolchain) must exist at the repo root. It is the first thing the agent loads. It contains: - the non-negotiables (the irreducible rules — see [`../../README.md`](../../README.md) for the eight-rule kernel), - a one-paragraph "repo at a glance", - a pointer to the routing table, - the build / test / gate commands. Keep it under 200 lines. Agents read the whole thing every session; long files dilute attention. Template: [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md). **Failure mode prevented:** agents reinventing rules each session because the rules were "in the chat" of a previous session, which the agent does not see. ### Rule 2 — Routing table A second file (`AGENTS.md` — separate from the non-negotiables doc) is a routing table. Two-column: "I want to change X" → "edit path Y". Rows are not for every file — they are for **every place an agent might plausibly land if they got it wrong**. If two rows could plausibly apply to the same change, the boundary is wrong. Fix the boundary or merge the rows. Template: [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md). **Failure mode prevented:** agents creating sibling packages because they did not know where the right one was; agents piling code into the largest file because no rule said where it went. ### Rule 3 — Persistent memory as one-fact-per-file Memory is what survives between sessions. It is not chat history; it is curated. - One memory = one file. - Files have frontmatter (`name`, `description`, `type`). - Types: `user`, `feedback`, `project`, `reference`. - An index file (`MEMORY.md`) lists them one-line each; the index is what loads every session. - Memories link to each other with `[[name]]`. When a non-obvious lesson lands, write a memory **immediately** — not at session end. Session-end is too late; you have already forgotten the precise context. Template: [`../../templates/MEMORY.md.template.md`](../../templates/MEMORY.md.template.md). **Failure mode prevented:** agents repeating the same fixed mistake across sessions because the lesson lived only in the prior conversation. ### Rule 4 — Verify-first Before any action, confirm state is what you think it is. - Before opening an issue: search if it already exists. - Before fixing an issue: confirm it is still open (`gh issue view \ --json state`). Another agent may have closed it. - Before pushing: `git fetch` and check if your branch is still up to date. - Before claiming a file exists: read it. Memories may reference files that were since renamed. - Before claiming duplication: pull the upstream `.d.ts` and read the real API. Naming similarity is not duplication. Verify-first is cheap. The failures it prevents are expensive (dup PRs that get rejected; conflicts at merge; stale-state code review). **Failure mode prevented:** agents grinding for hours on issues that were closed concurrently; agents proposing fixes to files that were since deleted; agents claiming duplication based on doc names instead of exported APIs. ### Rule 5 — One sub-unit per session A sub-unit is one discrete, shippable change. Defined up front. No scope creep mid-session. - If you discover a second issue while working, file it. Do not fix it now. - If the work is bigger than a session, split it into phases, ship phase 1, continue in a fresh session. - Quality over speed. A session that ships one clean sub-unit beats a session that ships three half-done ones. **Failure mode prevented:** large PRs that combine unrelated changes; reviewers unable to verify intent; later-session agents reverting one part of the work because they only saw the related part. ### Rule 6 — Honest reporting When the agent reports state at the end of a turn, the report must match reality. - "Tests passed" only if all tests in scope passed. If 12 of 13 passed, say so, and quote the failure. - "Quality gates green" only if gates ran and exited 0. - "Step was skipped" rather than burying it. - "I could not verify X" rather than asserting X. Production agents that report optimistically erode trust fastest. Once the reviewer cannot believe the agent's report, every PR needs full re-verification — which defeats the productivity gain. **Failure mode prevented:** silent regressions; PRs landing red because the agent claimed green; reviewer fatigue forcing manual verification of every claim. ### Rule 7 — Explicit goal, explicit exit condition A session has a goal. The goal is stated, not implied. The agent works toward the goal until an exit condition holds. - "Add login flow" is not a goal. "Add OAuth login with Google + GitHub providers, tested against the mock IdP, behind a feature flag" is a goal. - "Until the user says stop" is not an exit condition. "When the test passes and the PR is open" is an exit condition. Toolchains expose this as "goal mode" or "stop hooks". Use them. The agent's heuristic to stop is unreliable; the explicit condition is reliable. **Failure mode prevented:** agents stopping mid-task because the conversation turn felt like a stopping point; agents over-iterating on a task that was actually done. ### Rule 8 — Delegate fan-outs to scoped sub-agents A sub-agent is a scoped specialist: it gets a narrow task, a narrow toolset, and its result is returned as a summary. Sub-agent types worth defining: | Type | Tools | Use when | |---|---|---| | `explore` | read + grep + glob | Searching across many files; you only need the conclusion | | `plan` | read + web fetch | Designing a step-by-step approach before implementing | | `code-reviewer` | read + git diff | Confidence-filtered review pass | | `implementer` | read + edit + bash | Building a sub-unit against a finalised plan | Tier the model by task complexity: light tools / search → haiku-tier; documentation, unit tests, code review → sonnet-tier; complex reasoning → opus-tier. Reserve the largest model for what truly needs it. **Failure mode prevented:** one agent context bloating with file dumps from a search; one agent context losing focus by interleaving planning with implementation. ### Rule 9 — Concurrent-agent awareness Your branch is not the only branch. Another agent may be: - editing the same file in a parallel worktree, - closing the issue you are about to fix, - merging a PR that conflicts with yours, - pushing to main while you rebase. Defensive practices: - `git fetch` at session start. - `gh pr list --search "is:open \"` to detect parallel work touching the same files. - Rebase, don't merge, when integrating main into a feature branch. - Stash + verify red on a clean `origin/main` before "fixing" a CI failure — the failure may be pre-existing, not your fault. **Failure mode prevented:** PRs that conflict catastrophically with parallel work; agents shipping fixes for already-fixed bugs; agents accidentally reverting peer work via `--theirs`/`--ours`. ### Rule 10 — Lessons land in memory the moment they happen When you discover a fact that future sessions will need, write a memory. Not at session end. Not "if I have time". Now. Triggers to write a memory: - The user corrected you on a non-obvious point. - You discovered a failure mode that took >15 minutes to debug. - You found a non-obvious convention by reading code. - You confirmed something that contradicts what a doc says. What does **not** trigger a memory: - Information already in `CLAUDE.md` or `AGENTS.md`. - Information derivable from `git log` or `gh issue view`. - One-time facts that will not recur (a specific bug fix unrelated to a pattern). The memory is a fact + how to apply it + why. Three lines minimum. If you cannot write three lines, the lesson is not yet learned. **Failure mode prevented:** the same lesson re-discovered every quarter; memory bloating with chat-scoped facts; new agents getting no benefit from prior sessions. ## See also - [`../../templates/CLAUDE.md.template.md`](../../templates/CLAUDE.md.template.md), [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md), [`../../templates/MEMORY.md.template.md`](../../templates/MEMORY.md.template.md). - [`../governance/README.md`](../governance/README.md) — PR-intent + merge rules that operationalize rules 5, 6, 9. - [`../../prompts/README.md`](../../prompts/README.md) — system prompts + sub-agent recipes for rule 8. ==== https://playbook.agentskit.io/docs/pillars/architecture --- title: 'Pillar — Architecture' description: 'How to keep a codebase **legible and modular under multi-agent development**.' --- # Pillar — Architecture How to keep a codebase **legible and modular under multi-agent development**. The single biggest predictor of agent quality is whether the codebase tells the agent where to put new code. If the package boundaries are vague, agents pile features into the nearest big file. If the boundaries are explicit and named, agents route correctly even without supervision. ## Documents in this pillar | Doc | Layer | Read when | |---|---|---| | [`universal.md`](./universal.md) | Stack-agnostic principles | Designing any codebase | | [`ts-concrete.md`](./ts-concrete.md) | TS / pnpm / Turbo recipes | Implementing in a TS monorepo | | [`adr-pattern.md`](./adr-pattern.md) | Universal + TS | Recording a decision | | [`rfc-pattern.md`](./rfc-pattern.md) | Universal + TS | Proposing a breaking change | | [`contracts-zod-pattern.md`](./contracts-zod-pattern.md) | TS-concrete | Designing JSON-RPC / HTTP / IPC boundaries | | [`error-hierarchy.md`](./error-hierarchy.md) | Universal + TS | Designing the error model | | [`file-size-budget.md`](./file-size-budget.md) | Universal + TS | Enforcing reviewability | | [`anti-overengineering.md`](./anti-overengineering.md) | Universal | Resisting agent default-to-abstract | | [`feature-flags-pattern.md`](./feature-flags-pattern.md) | Universal + TS | Decoupling deploy from release | | [`api-versioning-pattern.md`](./api-versioning-pattern.md) | Universal | Breaking-change deprecation lifecycle | | [`distributed-data-pattern.md`](./distributed-data-pattern.md) | Universal | Replicas, sharding, CAP, eventual consistency | | [`multi-region-pattern.md`](./multi-region-pattern.md) | Universal | Geo failover, sovereignty, RPO/RTO | | [`event-streaming-pattern.md`](./event-streaming-pattern.md) | Universal | Queues, pub/sub, streams; idempotency; DLQ; schema evolution | | [`caching-cdn-pattern.md`](./caching-cdn-pattern.md) | Universal | 3 cache tiers; TTL discipline; invalidation; key scoping | | [`api-gateway-pattern.md`](./api-gateway-pattern.md) | Universal | Edge ingress; what belongs vs not; BFF; GraphQL federation | | [`service-mesh-pattern.md`](./service-mesh-pattern.md) | Universal | Sidecar mTLS; retries; observability; when to adopt vs not | | [`platform-engineering-idp-pattern.md`](./platform-engineering-idp-pattern.md) | Universal | Internal Developer Platform; golden paths; DORA metrics | | [`iac-pattern.md`](./iac-pattern.md) | Universal | Infrastructure as code; modules; state; drift; cost forecast | | [`offline-first-sync-pattern.md`](./offline-first-sync-pattern.md) | Universal | Local persistence; sync protocols; conflict resolution; CRDT | ## The core idea A codebase has three architectural surfaces, and each one needs a different kind of documentation: 1. **Boundaries** — what depends on what. Documented as a package routing table. 2. **Decisions** — why the boundaries are where they are. Documented as ADRs. 3. **Contracts** — what crosses boundaries. Documented as Zod (or equivalent) schemas, with stable error codes and versioning. If any of the three is implicit, agents will reinvent it differently each session. ## Anti-patterns this pillar prevents - Agents reimplementing upstream primitives because the routing table didn't say where they live. - Agents proposing breaking changes in a PR description instead of an RFC, so the change is invisible to future agents. - Agents throwing raw `new Error('...')` at a JSON-RPC boundary, making the error opaque in the client. - Files growing to 1500 lines because no budget said "extract". - Two agents creating sibling packages that re-export the same primitive under different names. ## How to adopt 1. Read [`universal.md`](./universal.md). Internalize the five non-negotiables. 2. Write your project's `AGENTS.md` routing table (template in [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md)). 3. Stand up ADR + RFC directories (templates in [`../../templates/`](../../templates/)). Number the first ADR "Philosophy". 4. Wire structural gates from [`../../scripts/`](../../scripts/) — the file-size, named-export, and no-`any` gates pay back in week one. 5. Add an ADR every time an agent proposes a structural change. Reject the change if there is no ADR. ==== https://playbook.agentskit.io/docs/pillars/architecture/adr-pattern --- title: 'ADR Pattern' description: 'How to record architecture decisions so future agents (and humans) can find them, trust them, and supersede them cleanly.' --- # ADR Pattern How to record architecture decisions so future agents (and humans) can find them, trust them, and supersede them cleanly. ## TL;DR (human) An **Architecture Decision Record** is a short, append-only document — usually 50–200 lines — that captures one decision: what changed, why, what was rejected, what becomes harder. Number them monotonically. Never delete; tombstone when superseded. The accepted ADR is the source of truth — the code implements it. ## For agents ### When to write an ADR Write one when **any** of the following is true: - The change crosses a package boundary or introduces a new package. - The change introduces a new top-level concept (a new error namespace, a new lifecycle, a new persistence store). - The change is reversible only at significant cost (>1 day to undo cleanly). - A reviewer asks "why this and not the alternative?". - The same question has come up twice. If none of the above hold, do not write an ADR. ADRs are not a substitute for code comments. ### Numbering + filenames - Zero-padded sequence: `0001-philosophy.md`, `0014-contract-registry.md`, `0042-file-size-budget.md`. - Slug is kebab-case, descriptive in three words or fewer. - Numbers are never reused. Tombstoned ADRs keep their number. ### Sections Use [`../../templates/ADR.template.md`](../../templates/ADR.template.md). Required sections: 1. **Status** — Proposed / Accepted / Superseded by ADR-NNNN / Tombstoned. 2. **Context** — what is true today; what triggered the decision. 3. **Decision** — what we will do, in imperative voice ("we will..."). 4. **Consequences** — what becomes easier, what becomes harder, what is now forbidden. 5. **Alternatives considered** — at least one. "We did not consider alternatives" is a yellow flag. 6. **Rollout** — how the codebase reaches the new state (codemod, gate, manual sweep, etc.). Optional sections: open questions, related ADRs, related issues / PRs. ### Lifecycle ``` Proposed → (review window) → Accepted ↘ Rejected → (kept on disk for record) Accepted → Superseded by ADR-NNNN (when a later ADR replaces it) → Tombstoned (when the surface it describes is removed) ``` Rules: - **Accepted** is the source of truth. If the code disagrees with an accepted ADR, the code is wrong (or the ADR needs a superseder). - **Never edit an accepted ADR's Decision section.** Replace via a new ADR that supersedes it. The trail must be diff-friendly. - **Tombstoned ADRs stay on disk.** Prepend a one-line tombstone notice; do not delete the body. ### Gate Two automated checks pay off: 1. **Sequence integrity** — no gaps, no duplicates, no missing numbers. 2. **Status hygiene** — every ADR has a recognised Status value; "Superseded by ADR-NNNN" references a real file. Reference impl: [`../../scripts/check-adr.example.mjs`](../../scripts/check-adr.example.mjs). ### Common failure modes (sourced from production) - **"We'll add the ADR later".** Later never comes. The reviewer who needed the rationale moves on. Result: agents revert the change six months later because they cannot see why it was made. → Block PRs that change architecture without an ADR. - **Editing the Decision of an accepted ADR.** Git history shows you reversed the rule, but the file reads as if the rule was always the new one. Future agents are confused. → Supersede via a new ADR. - **One mega-ADR per quarter.** Covers ten decisions, three of which contradict. Cannot be referenced cleanly. → One decision per ADR. - **No "Alternatives considered".** Agents propose the same alternative again next session, costing review cycles. → Always list at least one rejected alternative with the reason. ### See also - [`rfc-pattern.md`](./rfc-pattern.md) — when an ADR is not enough. - [`../governance/README.md`](../governance/README.md) — PR-intent manifests that reference ADRs. - [`../../templates/ADR.template.md`](../../templates/ADR.template.md) — copy-paste skeleton. ==== https://playbook.agentskit.io/docs/pillars/architecture/anti-overengineering --- title: 'Anti-Overengineering' description: 'How to keep agents (and engineers) from building three layers of abstraction for what would have been ten lines of code.' --- # Anti-Overengineering How to keep agents (and engineers) from building three layers of abstraction for what would have been ten lines of code. ## TL;DR (human) Agents over-abstract by default — interfaces around single implementations, generic registries for two callers, plugin systems for a known-fixed set. The discipline is **YAGNI** enforced by complexity budgets: cyclomatic complexity per function, dependency-depth per module, abstraction count per package. Code that "might be needed someday" almost never is; when it is, you add it then. ## For agents ### The default failure mode Faced with a task, an agent's default reach is to abstract. Three patterns recur: 1. **Interface for one implementation.** A class with a single constructor → wrapped in an interface "in case we add another". The interface adds nothing; readers click through it for nothing. 2. **Generic for two callers.** Two call sites with subtle differences → a generic helper with a config object. The config object grows; the function becomes harder to read than the two originals. 3. **Plugin system for a known set.** Three known integrations → an extensible plugin loader. The loader is more code than the three integrations combined. None of these are wrong in principle. They are wrong when the abstraction does not yet pay for itself. ### YAGNI test Before adding any of: - Interface / abstract base class. - Generic helper. - Plugin / registry / strategy pattern. - Configuration option. - Indirection (manager / coordinator / orchestrator). Ask: 1. **Do I have two real, current call sites?** Not "I can imagine two". Two concrete sites in code today. 2. **Do the two sites diverge in ways the abstraction would unify?** If they happen to share shape but their concerns are different, abstracting couples them. 3. **Is the abstraction smaller than the duplication it removes?** If the abstraction is bigger, it's overhead. If any answer is no: do not abstract yet. Inline the second call. When a third call appears, revisit. ### Complexity budgets Hard caps, enforced by a gate: | Metric | Budget | |---|---| | Cyclomatic complexity per function | ≤ 14 | | Function length | ≤ 60 lines | | Parameter count per function | ≤ 5 | | Class member count | ≤ 15 (often signals the class is doing too much) | | Module export count | ≤ 20 (more = split the module) | | Dependency depth (a calls b calls c calls d...) | ≤ 6 | | `extends` chains | ≤ 2 | When a budget breaks, the answer is **simplification, not increase the budget**. ### Patterns that recur as over-engineering | Pattern | Symptom | When it's earned | When it's overhead | |---|---|---|---| | Repository pattern | `UserRepository` wrapping ORM | Multiple persistence backends | One DB, one ORM, no churn — use the ORM directly | | Service layer | `UserService` calling `UserRepository` | Cross-store transactions | CRUD-only, no business logic | | DTO mapping | `UserDTO ↔ UserModel ↔ UserEntity` | API and DB shapes diverge | They have the same fields | | Factory | `UserFactory.create()` | Construction logic is non-trivial | Just `new User(args)` | | Generic event bus | for 3 events | When event shape varies + decoupling needed | Direct call is clearer | | Config object | 8-field `{...opts}` | Many call sites with diverse needs | One caller; positional args fine | | Custom hook for one call | `useCount()` wraps `useState` | Reused logic ≥ 3 sites | Inline `useState` | | Wrapper component for one prop | `` is a `\` with `class` | Variants justify it | Direct className | | Indirection through manager | `WidgetManager.create(WidgetSpec)` for two widgets | Plugin ecosystem | Static dispatch | ### "Premature optimization is the root of all evil" — and its corollary The famous line. The corollary, less quoted: **premature flexibility is more expensive than premature optimization**, because optimization can be torn out, but flexibility breeds usage that locks the shape in. Apply the same skepticism to "extension points" as to "fast paths". Add when needed. Inline when not. ### Signs the codebase is over-engineered - Reading a function requires following 4+ indirections to find what it actually does. - A bug fix requires editing files in 3+ layers. - New developer onboarding takes 3+ days to "understand the architecture". - "Where do I add X?" has multiple plausible answers. - The configuration documentation is longer than the implementation. - Pull requests are mostly plumbing changes. ### Refactor in the simplify direction When you find over-engineered code: 1. Inline a layer at a time. Measure: does each inline make the call site harder or easier to read? 2. Stop inlining when each step starts to hurt readability. 3. The final shape is often shallower than the original by 2–3 layers. This is the opposite of the usual refactor direction. Both are valid moves; agents tend to only know one. ### Gate (optional) A complexity-budget gate (similar to file-size) tracks: - Cyclomatic complexity per function (use `complexity` ESLint rule). - Function length. - Module export count. Shrink-only baseline. New functions / modules respect the budgets; existing offenders cannot grow. Reference impl shape: parse with `@typescript-eslint/parser`, walk function nodes, measure, compare to baseline. ### Common failure modes - **Building "platform" before product.** Six months on a plugin loader; no plugins yet. → Ship product; carve out platform when 3+ plugins force the shape. - **Adding a config option per request.** Three options become twenty; nothing uses combination N. → Defaults are the contract; options are for real, recurring needs. - **Abstraction with one implementation.** The interface exists "for future flexibility". → Delete the interface; use the class directly. - **Future-proofing for problems that never arrive.** "We'll need to support 10 databases someday." → Support the one you use now; design the boundary so adding the second is one PR. - **Architecture astronaut review comments.** "What if we wanted to..." → Reviewer should suggest the simpler option, not the more elaborate one. ### When to actually abstract These are real signals: - Three or more current callers with shared concerns. - The thing being abstracted is changing for reasons the callers should not care about. - The duplication is across files that change together for the wrong reason (the change scattered). - A test forces an abstraction (the test can only run if a seam exists). When at least two of these hold, abstract. Otherwise, inline + wait. ### Anti-overengineering as a culture This pillar fights against agent default reach. Reinforce by: - Reviewer prompt explicitly asks "could this be simpler?". - The `system-implementer` prompt forbids speculative abstraction. - Code review praises inlined, direct code. - Reading group on classics (John Carmack's inlined-code essay, Casey Muratori on hierarchies). ### See also - [`file-size-budget.md`](./file-size-budget.md) — size budget complements complexity budget. - [`universal.md`](./universal.md) — Rule 1 names every boundary; named boundaries do not need wrappers. - [`../quality/test-pyramid.md`](../quality/test-pyramid.md) — over-abstracted code is harder to test. ==== https://playbook.agentskit.io/docs/pillars/architecture/api-gateway-pattern --- title: 'API Gateway Pattern' description: 'The edge between clients and your services — what belongs there, what doesn''t, and how to keep it from becoming a monolith in disguise.' --- # API Gateway Pattern The edge between clients and your services — what belongs there, what doesn't, and how to keep it from becoming a monolith in disguise. ## TL;DR (human) An API gateway is the single ingress point: TLS termination, auth verification, rate limiting, routing, observability. It should be **thin** — cross-cutting concerns only; never business logic. Fat gateways become bottlenecks and deploy risks. Common patterns: BFF (Backend-for-Frontend), GraphQL federation, reverse-proxy router. ## For agents ### What belongs in the gateway | Concern | At gateway | |---|---| | TLS termination | ✓ | | Request routing (host / path) | ✓ | | Auth token verification (cheap path) | ✓ (deeper auth = service-level) | | Rate limiting (per IP / per identity) | ✓ | | Request / response logging | ✓ (sampled) | | Trace propagation (request-id) | ✓ | | Compression (gzip / brotli) | ✓ | | Static asset serving | ✓ (or CDN) | | Header normalisation | ✓ | | CORS | ✓ | | Bot detection | ✓ | | Geo restrictions | ✓ | | Request transformation (rare) | Maybe | | Response caching (rare) | Maybe | | Business logic | ✗ | | Database queries | ✗ | | Cross-service orchestration | ✗ | ### What does NOT belong - **Business validation**: each service validates its own inputs (per [`contracts-zod-pattern.md`](./contracts-zod-pattern.md)). - **Service-specific transforms**: that's the service's job; gateway should be generic. - **Cross-service orchestration**: separate orchestration service; not gateway. - **Data fetching**: gateway passes through; services own data. Symptoms of a fat gateway: - Gateway codebase larger than any backend service. - Gateway requires expert team to modify. - Gateway deploys gate other deploys. - Gateway is a single point of failure with no quick replacement path. If your gateway has these symptoms, it's eaten responsibilities. Trim. ### Architectures **Reverse-proxy router** (thinnest): pure routing + cross-cutting: ``` client → ALB / NGINX / Envoy → service A → service B → service C ``` Services own their own auth, validation, business logic. Gateway adds little but routing and cross-cutting. **BFF (Backend-for-Frontend)**: one gateway-ish service per client type: ``` web client → BFF-web → (services) mobile → BFF-mob → (services) admin → BFF-adm → (services) ``` BFF tailors API shape per client; reduces per-client over-fetch. Avoids "the API tries to please everyone". **GraphQL gateway / federation**: GraphQL endpoint composes subgraphs: ``` client → GraphQL gateway → user-service (User subgraph) → flow-service (Flow subgraph) → audit-service (Audit subgraph) ``` Federation (Apollo Federation, Hot Chocolate Federation): subgraphs declare their types; gateway stitches. **API mesh** (less common): pure proxy with declarative composition rules; no code. ### Choosing | Pattern | Use when | |---|---| | Reverse-proxy | < 10 services; simple shape | | BFF | Multiple client types with distinct API needs | | GraphQL | Rich data graph; many clients; query flexibility valued | | Mesh | Lots of services + standard composition; less common | ### Authentication at the gateway The gateway verifies tokens (cheap path): signature check, expiry, basic shape. Deeper auth (per-resource access, capability checks) happens at the service layer (per [`../security/rbac-pattern.md`](../security/rbac-pattern.md)). Gateway sends the verified principal-id; service makes finer decisions. Why split: - Gateway can stay generic + fast. - Services know their own resources + permissions; centralised would couple too tightly. ### Request-id + tracing Gateway: - Generates `requestId` (UUID/v7 or ULID) if missing. - Adds `X-Request-Id` header on outgoing request to services. - Creates the root trace span. - Logs the request with id. Services propagate. Observability correlates (per [`../quality/observability-pattern.md`](../quality/observability-pattern.md)). ### Versioning at the gateway If versioned URLs (`/v1/...`, `/v2/...`): - Gateway routes by version. - Both versions live during deprecation. - Gateway can run a transformation if v1 → v2 is mechanical (rare; usually the service version owns it). ### Caching at the gateway For cacheable responses: - Honor `Cache-Control` from services. - Tag-based purge. - Per-user cache requires care (key must include user id). Most caching is best at CDN (closer to users); gateway-level caching is for shared backend responses. ### Cost concerns Gateway is on every request — cost discipline: - Latency budget: < 10ms steady-state at the gateway. - Memory + CPU profile under load. - Auto-scale per traffic. - Per-environment sized (staging smaller than prod). A slow gateway impacts every endpoint. Profile + tune. ### Failure mode: gateway as bottleneck When the gateway can't be skipped, it's a single point of failure. Mitigations: - **Multi-region**: per-region gateway (per [`multi-region-pattern.md`](./multi-region-pattern.md)). - **Multi-AZ within region**. - **Health checks + auto-replacement**. - **Direct service access for internal callers**: services call each other directly when feasible, bypassing gateway. ### Service-to-service Inside the cluster, services often call each other directly (mesh) rather than through the gateway: - Gateway is for external traffic. - Internal: service-mesh (per [`service-mesh-pattern.md`](./service-mesh-pattern.md)) handles cross-cutting (mTLS, observability, retries). Don't route internal traffic through the external gateway. Adds latency; couples internal architecture to external entry. ### Deployment risk Gateway deploys block every service. Mitigations: - Canary deploys. - Blue-green for the gateway specifically. - Roll-forward only (per [`../quality/ci-cd-pipeline-pattern.md`](../quality/ci-cd-pipeline-pattern.md)). - Practice gateway rollback (drill). ### Common failure modes - **Fat gateway**: business logic crept in. → Refactor; move logic to services. - **Gateway as a deploy gatekeeper**: services can't deploy without gateway change. → Stable contract; service-side changes don't require gateway changes. - **Gateway as cache that lies**: stale data; users confused. → Conservative caching; service-driven invalidation. - **No request-id propagation**: cannot trace requests. → Mandatory. - **CORS handled in each service inconsistently**: → Centralise at gateway. - **TLS termination at gateway only**: internal traffic plaintext. → mTLS internal (service mesh). ### Tooling stack (typical) | Concern | Tool | |---|---| | Cloud-native | AWS API Gateway, GCP API Gateway, Azure APIM | | Self-hosted | Kong, Tyk, KrakenD, Apache APISIX | | Reverse-proxy | NGINX, Caddy, Envoy, Traefik, HAProxy | | GraphQL federation | Apollo Router, Cosmo, Mercurius | | BFF framework | Whatever your stack uses (Next.js, Nest.js, Rails, etc.) | ### Adoption path 1. **Few services**: ALB / Load balancer is enough; no "gateway" per se. 2. **~10 services**: reverse-proxy gateway with cross-cutting concerns. 3. **Multiple client types**: BFFs. 4. **Rich data graph**: GraphQL federation. 5. **Mature mesh**: gateway + service mesh; internal traffic doesn't traverse gateway. ### See also - [`../security/rate-limiting-ddos-pattern.md`](../security/rate-limiting-ddos-pattern.md) — gateway-side rate limiting. - [`../security/session-mgmt-pattern.md`](../security/session-mgmt-pattern.md) — token verification at gateway. - [`service-mesh-pattern.md`](./service-mesh-pattern.md) — internal traffic. - [`caching-cdn-pattern.md`](./caching-cdn-pattern.md) — CDN sits in front of gateway. - [`anti-overengineering.md`](./anti-overengineering.md) — premature gateway = the canonical trap. ==== https://playbook.agentskit.io/docs/pillars/architecture/api-versioning-pattern --- title: 'API Versioning + Deprecation Pattern' description: 'How to evolve public contracts without surprising consumers.' --- # API Versioning + Deprecation Pattern How to evolve public contracts without surprising consumers. ## TL;DR (human) Public APIs follow semver per package. Non-breaking changes (add field, add method) are minor. Breaking changes (rename, remove, type change) are major and require an RFC, a deprecation window, and a migration guide. The wire stays stable longer than the code; clients trust this or they leave. ## For agents ### Three change categories | Change | Semver | Process | |---|---|---| | Add a new method | minor | Free; new entry in registry | | Add an optional field to params | minor | Default value handles old callers | | Add a field to result | minor | Old clients ignore unknown fields | | Tighten validation (was permissive) | major | Existing payloads may newly reject | | Loosen validation | minor (usually) | Old clients still parse | | Remove a method | major | Deprecation cycle | | Rename a field | major | Deprecation cycle (keep both, then drop old) | | Change a field's type | major | Deprecation cycle | | Change an error code | major | Code is the contract; clients pattern-match | | Change semantics of an existing method | major (always) | Even if shape unchanged | ### Deprecation lifecycle A breaking change to a public surface follows: 1. **RFC.** Proposed change + migration plan. Review window per [`rfc-pattern.md`](./rfc-pattern.md). 2. **Implement the new shape alongside the old.** Both work. Both have tests. 3. **Mark the old as deprecated.** `@deprecated` JSDoc; `Deprecation` HTTP header; explicit log warning per use. Document migration in the deprecation message. 4. **Deprecation window**: at least one major version, or the documented period (commonly 90 / 180 / 365 days). 5. **Telemetry**: instrument old-shape usage. If usage drops to zero earlier, accelerate retirement. 6. **Retire.** Major bump. Old shape removed in the same release. The window is the contract. Honor it even if internal usage is zero — external consumers may exist. ### `@deprecated` discipline A deprecation comment must include: - **What is deprecated** (the symbol / method / shape). - **When it will be removed** (target version or date). - **What to use instead** (the migration target, with a code example). - **Why** (the rationale; usually a link to the RFC). ```ts /** * @deprecated Since v2.3. Will be removed in v3.0. * Use `users.invite` instead, which carries explicit role assignment. * See RFC-0023 (https://...). * * @example * // before * await client.users.create({ email, defaultRole }); * // after * await client.users.invite({ email, role }); */ export function create(params: CreateParams) { ... } ``` A `@deprecated` with no migration path is debt, not deprecation. ### Migration guides Per major release, a migration guide doc lives at `docs/migrations/v\.md`. Sections: - **Summary** of breaking changes. - **Per-change**: before / after code snippets, automated codemod (if any), test surface to verify. - **Order of operations** for migrating a large consumer codebase. - **Rollback procedure** if migration fails. A migration guide that doesn't tell consumers the **order** to migrate ("update X first, then Y") is incomplete. ### Codemods for major bumps For migrations that are mechanical (rename a field, restructure a call), ship a codemod: ```bash npx @your-org/migrate --from v2 --to v3 --dry-run ``` Codemods transform source code on the consumer's repo. Standard tools: jscodeshift, ast-grep, ts-morph. Even when the codemod doesn't cover 100% of cases, it covers the bulk; humans / agents handle the tail. The 80/20 rule: a codemod that handles 80% of call sites is worth shipping. ### Backwards-compat shims When a breaking change is essentially "rename one field": ```ts // Receive both old and new; emit only new. const Params = z.object({ email: z.string().email(), // old → new shim defaultRole: z.string().optional(), // deprecated role: z.string().optional(), }).transform((d) => ({ email: d.email, role: d.role ?? d.defaultRole, })); ``` The shim: - Lives only during the deprecation window. - Logs a deprecation warning when the old field is used. - Has a removal date; removal PR is pre-scheduled. Shims that outlive their deprecation window are debt. ### Wire format vs internal types The wire format is the contract. Internal types can refactor freely as long as the wire is unchanged. When agents look at the schema package and think "this is convoluted", check: is it convoluted because of *wire compatibility*? If so, leave it. Refactor the internal layer (the handler, the store) without touching the wire. ### Version negotiation If you support multiple major versions concurrently: - **URL-based**: `/v1/...` / `/v2/...` paths. - **Header-based**: `Accept: application/vnd.api+json; version=2`. - **Per-request**: the client sends its version; the server adapts. URL-based is easiest to implement and discover. Header-based is cleaner conceptually but harder to debug. ### REST / RPC / GraphQL specifics | Style | Versioning convention | Notes | |---|---|---| | REST | URL path segment | Most discoverable | | JSON-RPC | Method name suffix or namespace | `users.list.v2` | | GraphQL | Schema evolution (no versions); field-level `@deprecated` | The GraphQL way: deprecate fields, never remove silently | | gRPC | `proto3` field reservation | Reserve removed field numbers to prevent reuse | Pick one. Mixing styles confuses consumers. ### Stable error codes — the same rules Error codes are part of the public contract. The same rules apply: - Append-only. - Rename = breaking → RFC. - New codes are minor. - Removed codes are major (clients pattern-match). See [`error-hierarchy.md`](./error-hierarchy.md) for the error model. ### Common failure modes - **Silent breaking change.** "Refactor" PR changes a wire field; clients break. → Gate detects schema diff; requires RFC reference. - **Deprecation without migration path.** `@deprecated` says "use the new method" without examples. → Migration code required in the comment. - **Removing on next release.** Same release deprecates AND removes. → Honor the window. - **Two versions live forever.** v1 + v2 + v3 all maintained; engineering velocity craters. → Sunset old majors on a calendar. - **Forgot to bump major.** Patch release breaks consumers. → Schema-diff gate (see [`contracts-zod-pattern.md`](./contracts-zod-pattern.md)) catches. ### See also - [`rfc-pattern.md`](./rfc-pattern.md) — breaking changes require RFC. - [`contracts-zod-pattern.md`](./contracts-zod-pattern.md) — schema gate detects breaking diffs. - [`error-hierarchy.md`](./error-hierarchy.md) — error codes are contract. - [`feature-flags-pattern.md`](./feature-flags-pattern.md) — flags ramp new behavior without breaking old. ==== https://playbook.agentskit.io/docs/pillars/architecture/caching-cdn-pattern --- title: 'Caching + CDN Pattern' description: 'How to layer caches so the system is fast, cheap, and consistent — three properties at tension.' --- # Caching + CDN Pattern How to layer caches so the system is fast, cheap, and consistent — three properties at tension. ## TL;DR (human) Three cache tiers: in-process, distributed, CDN/edge. Each has different latency, hit-rate, and invalidation profile. Cache invalidation is the second-hardest problem; default to TTL backstops on everything. Cache keys must scope by tenant; cache busting requires a strategy designed before the first cache lands. ## For agents ### The three tiers | Tier | Latency | Hit rate | Invalidation difficulty | |---|---|---|---| | **In-process** (memory, per-process) | sub-µs | High per worker; low across cluster | Easy (process restart) | | **Distributed** (Redis, Memcached, Hazelcast) | sub-ms (in-region) | Cross-process; cluster-wide | Medium | | **CDN / edge** (Cloudflare, Fastly, Akamai, CloudFront) | ms globally | Geographically distributed | Hard | A typical product uses all three: in-process for tiny hot objects, distributed for shared reads, CDN for public assets + cacheable HTML. ### Cache invalidation strategies | Strategy | When to use | |---|---| | **TTL only** | Eventually-consistent data; simplest; safe default | | **Write-through** | Cache updated on every write | | **Write-around** | Writes skip cache; reads fetch + cache | | **Write-back** | Cache absorbs writes; flushed async (rare; risk of loss) | | **Event-based** | On write, publish invalidate event; consumers evict | | **Versioned keys** | Cache key includes entity version; new version = new key | | **Stale-while-revalidate** | Serve stale; refresh in background | Versioned keys are surprisingly powerful — no invalidation needed; updates just produce new keys. ### TTL discipline Every cache entry has a TTL. No exceptions. Even with event-based invalidation, TTL is the backstop — if the event is lost, the cache self-heals within the window. Per data class: - Hot public data (no user-specific): seconds to minutes. - Per-tenant cacheable: minutes. - Per-user session: minutes to hours. - Public static assets: long (days to year) + cache-busting via versioned URL. ### Cache key discipline Multi-tenant: every cache key includes tenant id: ``` cache:workspaces::user: cache:flows::list:v3:limit=20:cursor=abc ``` A key without tenant scope is a data-leak waiting to happen. Lint scans for cache calls in code without a tenant identifier. Key naming: - `cache:\:\:\:\`. - Include version in key (`v3`) to roll-forward without flushes. - Hash long composite keys; keep short prefix for debugging. ### Cache stampede protection When a hot key expires, N concurrent requests hit the origin. Mitigations: - **Probabilistic early refresh**: a small percentage of requests refresh proactively before expiry. - **Lock + single-flight**: only one request fetches; others wait. - **Stale-while-revalidate**: serve stale; refresh background. Default: stale-while-revalidate. Simple; effective. ### CDN tier CDN handles: - Public static assets (JS, CSS, fonts, images). Long TTL + content-hashed URL. - Public HTML / API responses (per `Cache-Control` headers). - Geographic distribution: edge POPs close to users. - DDoS absorption (large CDNs have multi-Tbps capacity). Configuration: - `Cache-Control: public, max-age=31536000, immutable` for content-hashed assets. - `Cache-Control: private` for user-specific. - `Cache-Control: no-store` for sensitive (auth tokens, PII). - `Vary` header for content negotiation. Edge invalidation: - Purge API per CDN; typically slow (minutes propagation). - URL purge: invalidate one path. - Tag purge: invalidate all URLs tagged X (when CDN supports). - Default: design URLs to NOT need invalidation (content-hashed). ### Browser cache Often forgotten: - Service worker (PWA) is a cache too. - HTTP cache (browser). - Memory cache (same session). Headers control. `immutable` + content-hashed URL eliminates revalidation entirely. ### When NOT to cache - Per-request authorization-dependent data (cache key must include the auth context exactly). - Tiny operations (cache lookup itself costs ~µs; uncached is cheaper). - Data that mutates per request (counters, rate-limit windows). - Sensitive PII or secrets. ### Cache hit-rate as a metric Per cache: - Hit rate (target: > 60% steady; varies). - Miss penalty (origin latency). - Eviction rate (high = too small or churn). Track. Alert if hit rate drops; signals key churn, app pattern change, or capacity issue. ### Common failure modes - **Cache key without tenant**: cross-tenant data leak. - **No TTL**: stale forever. - **Event-based invalidation, no TTL backstop**: missed event = forever stale. - **Cache stampede**: hot key expires; thunder herd hits origin. - **Caching authorized content publicly**: served to wrong users. - **Aggressive CDN cache on dynamic page**: users see other users' state. - **In-process cache in multi-replica deploy**: stale per replica; user gets different state each request. ### See also - [`distributed-data-pattern.md`](./distributed-data-pattern.md) — read-replica + cache interplay. - [`../security/multi-tenant-isolation-pattern.md`](../security/multi-tenant-isolation-pattern.md) — cache scope. - [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md) — cache hit rate as cost metric. - [`../quality/performance-budgets-pattern.md`](../quality/performance-budgets-pattern.md) — cache pays the budget. ==== https://playbook.agentskit.io/docs/pillars/architecture/contracts-zod-pattern --- title: 'Contracts — Zod Method Registry Pattern' description: 'TS-concrete recipe for a typed JSON-RPC / HTTP / IPC boundary. Scales to several hundred methods across dozens of namespaces in a real production codebase.' --- # Contracts — Zod Method Registry Pattern TS-concrete recipe for a typed JSON-RPC / HTTP / IPC boundary. Scales to several hundred methods across dozens of namespaces in a real production codebase. ## TL;DR (human) Every method that crosses a trust boundary has: - a name (namespaced, dot-separated), - a Zod schema for its params, - a Zod schema for its result, - explicit flags for `requireAuth` and `requireConsent`, - a registered entry in one method registry. A dispatcher iterates the registry. Inbound payloads are parsed; failures become typed errors; outbound results are also parsed (catches handler bugs at the boundary, not in the wire). ## For agents ### Method definition ```ts // packages/contracts/src/methods/users.ts import { z } from "zod"; import { defineMethod } from "../define-method"; export const UsersListParams = z.object({ workspaceId: z.string().uuid(), limit: z.number().int().positive().max(200).default(50), cursor: z.string().optional(), }); export const UsersListResult = z.object({ rows: z.array(z.object({ id: z.string().uuid(), email: z.string().email(), role: z.enum(["owner", "admin", "member"]), })), nextCursor: z.string().nullable(), }); export const usersList = defineMethod({ method: "users.list", params: UsersListParams, result: UsersListResult, requireAuth: true, requireConsent: false, }); ``` ### Registry ```ts // packages/contracts/src/registry.ts import { usersList } from "./methods/users"; import { usersUpsert } from "./methods/users"; // ... import all method definitions export const REGISTRY = { [usersList.method]: usersList, [usersUpsert.method]: usersUpsert, // ... } as const; export type MethodName = keyof typeof REGISTRY; ``` ### Dispatcher ```ts // packages/contracts/src/dispatcher.ts import { ZodError } from "zod"; import { AppError } from "@app/core/errors"; import { REGISTRY } from "./registry"; export type Handler = (params: P, ctx: CallContext) => Promise; export async function dispatch( method: string, rawParams: unknown, ctx: CallContext, handlers: Record>, ) { const entry = REGISTRY[method as keyof typeof REGISTRY]; if (!entry) throw new AppError("METHOD_NOT_FOUND", `Unknown method: ${method}`); if (entry.requireAuth && !ctx.principalId) { throw new AppError("AUTH_REQUIRED", "Authentication required"); } if (entry.requireConsent && !ctx.consents.has(method)) { throw new AppError("CONSENT_REQUIRED", "User consent required", { hint: `Call consent.grant({ scope: "${method}" })`, }); } let params; try { params = entry.params.parse(rawParams); } catch (err) { if (err instanceof ZodError) { throw new AppError("VALIDATION_ERROR", "Invalid params", { cause: err }); } throw err; } const handler = handlers[method]; if (!handler) throw new AppError("HANDLER_NOT_BOUND", `No handler for ${method}`); let result; try { result = await handler(params, ctx); } catch (err) { if (err instanceof AppError) throw err; // Unknown errors become opaque — never leak handler internals. throw new AppError("HANDLER_THREW", "Handler failed", { cause: err }); } // Verify the handler returned a result that matches the contract. return entry.result.parse(result); } ``` ### Wire serialization Errors over the wire: ```ts { jsonrpc: "2.0", id, error: { code: -32000, // or a stable numeric mapping message: appError.message, data: { code: appError.code, // "VALIDATION_ERROR", "AUTH_REQUIRED", ... hint: appError.opts.hint, docsUrl: appError.opts.docsUrl, requestId: ctx.requestId, // ALWAYS log + return the requestId }, }, } ``` Never include the `cause` chain in the wire payload — it can leak stack traces, file paths, secrets. Log the cause server-side with the `requestId` so support can correlate. ### Namespace conventions - All-lowercase, dot-separated. `users.list`, `flows.upsert`, `cost.budgets.list`. - The leading segment is the **owning surface** (a feature concept). The package that owns the namespace owns the handler. - A namespace is owned by one package. If two packages want to handle `users.*`, the boundary is wrong. ### Stable contract changes (without breaking) Add fields with defaults — non-breaking: ```ts export const UsersListParams = z.object({ workspaceId: z.string().uuid(), limit: z.number().int().positive().max(200).default(50), cursor: z.string().optional(), includeDisabled: z.boolean().default(false), // new field, default makes it non-breaking }); ``` Rename / remove fields — breaking → requires an RFC (see [`rfc-pattern.md`](./rfc-pattern.md)). Method-level renames — also breaking → RFC. Keep the old name registered as a deprecated alias for one major version. ### Gate Recommended automated checks: 1. **Registry completeness** — every file under `methods/` exports at least one `defineMethod` call, every export is in the registry. 2. **No duplicate method names** — fail at build. 3. **Schema-change detector** — diff the compiled `.d.ts` of the contract package vs the previous release; flag any method whose params/result signature changed without an RFC reference. 4. **Handler binding completeness** — every method in the registry has a handler bound in the runtime. Reference impls in [`../../scripts/`](../../scripts/). ### Common failure modes (sourced from production) - **Handler returns the right shape minus one field.** Without `entry.result.parse(result)`, the client sees `undefined` and fails downstream. → Parse outbound; pay the small CPU cost. - **Agent invents `users.fetchAll` because they did not search for `users.list`.** Two methods now exist; consumers split. → Maintain a namespace map (one file per namespace) and require all methods for that namespace live in that file. - **`requireAuth: false` slipped onto a sensitive method.** Silent vulnerability. → Default `requireAuth` to `true` in `defineMethod`; require an explicit `false` opt-out and review it in PR intent. - **Stack traces in `error.data`.** Leaks file paths and sometimes secrets. → Log the cause; never return it. ### See also - [`error-hierarchy.md`](./error-hierarchy.md) — error model the dispatcher uses. - [`../security/README.md`](../security/README.md) — auth + consent semantics behind the flags. - [`../quality/README.md`](../quality/README.md) — gates that enforce the registry shape. ==== https://playbook.agentskit.io/docs/pillars/architecture/distributed-data-pattern --- title: 'Distributed Data Pattern' description: 'How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.' --- # Distributed Data Pattern How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency. ## TL;DR (human) Distributed data starts with **read replicas** (cheap, mostly transparent). Then **sharding** (expensive, design-defining). Then **multi-region** (operationally hard, recovery-defining). Each step trades consistency for availability and complexity. Pick the cheapest one that solves the actual problem; do not adopt the next tier speculatively. ## For agents ### The CAP triangle, briefly A distributed system under partition can guarantee at most two of: **C**onsistency, **A**vailability, **P**artition tolerance. Real systems are not on the corners — they pick a position on the edges. - **CP** (consistency over availability under partition): banking, audit ledgers. Reads/writes refuse if quorum unreachable. - **AP** (availability over consistency): social feed, analytics. Reads/writes succeed; data is eventually consistent. - **CA** (no partition tolerance — only viable in a single node). You will be **AP** for most user-facing data and **CP** for money + audit + identity. ### Step 1 — Read replicas Cheap, mostly transparent. One primary handles writes; N replicas serve reads. Rules: - **Writes go to primary.** Always. - **Reads with strict freshness go to primary.** Authentication, "did my write land", post-transaction reads. - **Reads that tolerate staleness go to replicas.** Listings, dashboards, analytics. - **The application layer chooses**: `db.replica.users.list(...)` vs `db.primary.users.list(...)`. Not the ORM's auto-magic. Auto-routing produces surprise replication-lag bugs. Replication lag: - Typically tens to hundreds of milliseconds in steady state. - Spikes to seconds under load. - Tail can reach minutes during failover. Design your queries to tolerate the worst-case lag, or send the affected query to primary. ### Step 2 — Sharding When data per primary exceeds what one node handles — typically when total data approaches a TB or QPS exceeds 10k+ — shard. Shard key choice is permanent (or at least very expensive to change). Get it right. **Good shard keys**: - `tenant_id` / `workspace_id` for multi-tenant systems (most queries are per-tenant). - `user_id` for user-facing systems. - Time-bucketed for append-heavy systems (event logs). **Bad shard keys**: - Auto-increment id (sequential = hot last shard). - `created_at` only (hot active shard). - Anything that produces "one big shard" (one popular tenant). Cross-shard queries are expensive. Design queries so 95%+ stay within a shard. **Resharding** is its own discipline: - Pre-split into more shards than you currently need (over-shard). - Use logical shards mapped to physical nodes; moving a logical shard is a node-add operation. - Tools: Vitess, Citus, application-level sharding with consistent hashing. ### Step 3 — Multi-region When users are distributed globally OR the failure of one region must not take the system down — multi-region. Three patterns: 1. **Active-passive**: one region writes; others stand by. Failover is operator-driven; RPO = replication lag, RTO = minutes. 2. **Active-active with leader per partition**: each partition (tenant, customer, geographic block) has a leader region. Writes to your data only succeed in your region. Cross-partition operations are rare and expensive. 3. **Fully active-active with CRDT / multi-leader**: writes succeed anywhere; conflicts resolved at the data layer. Expensive but powerful. Most products start at 1. Mature SaaS at 2. Few need 3. ### RPO / RTO Per system, document: - **RPO** (Recovery Point Objective): how much data can we lose in a disaster? "5 minutes" means replication is configured for ≤ 5-min lag. - **RTO** (Recovery Time Objective): how long to be back up? "30 minutes" means the failover procedure must complete within that. RPO and RTO are *promises*. The infrastructure must be able to deliver them; the runbook must be tested. ### Replication lag — visible in product When you have replicas, replication lag becomes a product concern: - **Write-then-read in the same request**: route the read to primary or use a session-pinned router. - **Write-then-read across requests**: use a "version cookie" — the write returns a version stamp; the next read carries it; the read either waits or routes to primary. - **List-after-create**: the new record may not appear in the listing for a few hundred ms. Either send the listing to primary or surface the new record optimistically in the UI. This is **eventual consistency in disguise**. Document it; expect it. ### Eventual consistency UX When eventual consistency is exposed to users: - Communicate optimistically: show the new state immediately in the UI, even if the read hasn't caught up. - Reconcile on next reload: if the optimistic state was wrong, show the truth, with an explanation. - Avoid surfaces where strict consistency is expected (financial balances, audit logs). ### Distributed transactions Avoid. They are slow, fragile, and cap throughput. When you genuinely need atomicity across two stores: - **Saga**: a sequence of local transactions + compensations. Each step can fail; compensate the prior steps. Common for orchestrated workflows. - **Outbox**: write changes to an outbox table in the same transaction as the business write; a separate process publishes the outbox events. - **Two-phase commit**: only when latency / availability constraints allow. Rare in modern systems. The discipline: prefer single-store atomic operations + sagas + outboxes over distributed transactions. ### Distributed ID generation Auto-increment IDs do not work across shards. | Approach | Pros | Cons | |---|---|---| | UUID v4 | Trivial; collision-free | Random insert order kills B-tree performance | | UUID v7 / ULID | Sortable + collision-free | Standard support varies | | Snowflake | Sortable + compact | Coordination layer; clock-sensitive | | KSUID | Sortable; URL-safe | Larger than auto-increment | | Pre-allocated ranges per shard | Sortable + fast | Coordination at allocation time | Default: ULID. Sortable, collision-free, compact, library support broad. ### Caching tiers When the database is the bottleneck, caching tiers absorb load: 1. **In-process cache**: per-process; ~ms latency; short TTL. For read-heavy, low-mutation data. 2. **Distributed cache** (Redis, Memcached): cross-process; sub-ms in-region; medium TTL. For shared-read data. 3. **CDN**: edge cache; ~ms latency globally; long TTL. For public content + static assets. Cache invalidation is the second hard problem. Three strategies: - **Time-based** (TTL): simple; stale-but-bounded. - **Event-based**: on write, evict / update relevant cache entries. Complex; risk of bugs. - **Versioned keys**: each entity has a version; key includes version; updates produce new keys. Versioned keys are surprisingly powerful; consider them before event-based invalidation. ### Common failure modes - **Adopting sharding before exhausting a vertical scale.** A single big primary handles enormous load; sharding adds complexity for no benefit. → Measure first; vertical-scale first. - **Shard key chosen by intuition, not data.** Hot shard; resharding misery. → Analyze actual query patterns. - **Cross-shard queries everywhere.** Sharded but with the cost of unsharded. → Audit queries; >95% should be single-shard. - **Replication lag ignored in code.** Write-then-read inconsistency surfaces as random bugs. → Explicit primary/replica routing. - **Multi-region without RPO/RTO documented.** Failover happens; nobody knows what was lost. → Document; drill. - **Cache invalidation via event-broadcast, no fallback.** Event missed → stale forever. → TTL as a backstop on every cache. ### See also - [`anti-overengineering.md`](./anti-overengineering.md) — distributed data is the canonical over-engineering trap. - [`multi-region-pattern.md`](./multi-region-pattern.md) — operational concerns at region scope. - [`../security/multi-tenant-isolation-pattern.md`](../security/multi-tenant-isolation-pattern.md) — tenancy + sharding interplay. - [`../quality/observability-pattern.md`](../quality/observability-pattern.md) — measure replication lag, cache hit rate, query distribution. ==== https://playbook.agentskit.io/docs/pillars/architecture/error-hierarchy --- title: 'Error Hierarchy' description: 'How to design an error model that survives multi-agent development and client-side pattern matching.' --- # Error Hierarchy How to design an error model that survives multi-agent development and client-side pattern matching. ## TL;DR (human) One base class. One file of codes. Subclasses per namespace. Codes are append-only. Never throw raw `Error` at a boundary. The dispatcher is the only thing allowed to turn unknown thrown values into a generic opaque error. ## For agents ### Class shape ```ts // packages/core/src/errors/app-error.ts export type ErrorOpts = { readonly hint?: string; readonly docsUrl?: string; readonly cause?: unknown; }; export class AppError extends Error { constructor( readonly code: string, message: string, readonly opts: ErrorOpts = {}, ) { super(message, { cause: opts.cause }); this.name = this.constructor.name; } serialize() { return { code: this.code, message: this.message, hint: this.opts.hint, docsUrl: this.opts.docsUrl, }; } } ``` ### Subclasses One subclass per namespace. They exist so callers can `instanceof`-check by namespace and so codes group together visually. ```ts export class AuthError extends AppError {} // AUTH_REQUIRED, AUTH_FORBIDDEN, AUTH_EXPIRED export class ValidationError extends AppError {} // VALIDATION_ERROR, VALIDATION_RANGE, ... export class NotFoundError extends AppError {} // NOT_FOUND, NOT_FOUND_AFTER_DELETE export class ConflictError extends AppError {} // CONFLICT_VERSION, CONFLICT_LOCKED export class RateLimitError extends AppError {} // RATE_LIMIT_EXCEEDED, RATE_LIMIT_BLOCKED export class BillingError extends AppError {} // BILLING_PAYLOAD_INVALID, BILLING_PROVIDER_DOWN export class SecurityError extends AppError {} // SECURITY_EGRESS_DENIED, SECURITY_FIREWALL_BLOCK ``` Subclasses **do not add methods**. They exist for type discrimination. Adding behavior makes them harder for agents to reason about. ### Codes One file. Append-only. ```ts // packages/core/src/errors/codes.ts export const ERROR_CODES = { // auth AUTH_REQUIRED: "AUTH_REQUIRED", AUTH_FORBIDDEN: "AUTH_FORBIDDEN", AUTH_EXPIRED: "AUTH_EXPIRED", // validation VALIDATION_ERROR: "VALIDATION_ERROR", // existence NOT_FOUND: "NOT_FOUND", // conflict CONFLICT_VERSION: "CONFLICT_VERSION", // dispatcher synthetics METHOD_NOT_FOUND: "METHOD_NOT_FOUND", HANDLER_NOT_BOUND: "HANDLER_NOT_BOUND", HANDLER_THREW: "HANDLER_THREW", // ... } as const; export type ErrorCode = (typeof ERROR_CODES)[keyof typeof ERROR_CODES]; ``` Rules: - Format: `\_\`, all caps, snake-case, ASCII. - Append-only. **Never rename**; deprecate and add a new code. - One source file. If you need categorization, use comments and grouping. Do not split across files. - Every new code needs a one-line entry in `docs/errors/\.md` with: cause, hint, recovery, link to relevant ADR if any. ### When to throw what | Situation | Class | Code | |---|---|---| | Schema parse failed | `ValidationError` | `VALIDATION_ERROR` | | Unauthenticated caller hit auth-required method | `AuthError` | `AUTH_REQUIRED` | | Authenticated caller lacks capability | `AuthError` | `AUTH_FORBIDDEN` | | Resource id not in storage | `NotFoundError` | `NOT_FOUND` | | Optimistic-lock version mismatch | `ConflictError` | `CONFLICT_VERSION` | | Egress to non-allowlisted domain | `SecurityError` | `SECURITY_EGRESS_DENIED` | | Method exists but handler not registered | `AppError` (dispatcher) | `HANDLER_NOT_BOUND` | | Handler threw a non-AppError | `AppError` (dispatcher) | `HANDLER_THREW` | ### Lint rules Ban `throw new Error(` in boundary files: ```js // .eslintrc.cjs { files: [ "packages/*/src/methods/**", "packages/*/src/handlers/**", "packages/*/src/api/**", ], rules: { "no-restricted-syntax": ["error", { selector: "ThrowStatement > NewExpression[callee.name='Error']", message: "Throw a typed AppError subclass with a stable code instead.", }], }, } ``` Escape hatch: `// allow-raw-error: \` on the line above; a gate counts these. ### Wire serialization rules - The wire payload contains `code` + `message` + optional `hint` + optional `docsUrl` + `requestId`. - Never serialize `cause`. It can contain stack traces, file paths, env values, or secrets. - Log the `cause` server-side, tagged with the `requestId`, so support / on-call can correlate. - Intl-resolve the `message` at the boundary if the caller is a UI surface; do not assume the client speaks English. ### Tests Each method's contract test (per [`contracts-zod-pattern.md`](./contracts-zod-pattern.md)) covers: - Happy path: valid params, valid result. - Reject path: invalid params produce a `ValidationError` with code `VALIDATION_ERROR`. - Auth path: missing `principalId` produces `AUTH_REQUIRED`. Plus, every error code is exercised somewhere in the test suite — a separate gate scans tests for `code: "\"` assertions and fails if any code in `ERROR_CODES` is never asserted. ### Common failure modes (sourced from production) - **Agent throws `new Error("not authorized")`.** Client cannot pattern-match. → Lint blocks raw `Error` in boundary files. - **Agent renames a code from `AUTH_FORBIDDEN` to `FORBIDDEN`.** Existing clients stop matching. → Codes are append-only; renames require an RFC + a deprecation cycle. - **Codes drift in naming convention.** Some `AUTH_REQUIRED`, some `AuthRequired`. → One source file + a gate that asserts shape. - **Stack trace in `error.data` over the wire.** Leaks `/Users/\/.env` and the cwd. → Strip `cause` at serialization; log it server-side instead. - **Error message changes break a client assertion.** Tests assert on `.message` instead of `.code`. → Tests assert on `.code`; messages are intl-resolved and may change. ### See also - [`contracts-zod-pattern.md`](./contracts-zod-pattern.md) — the dispatcher serializes these. - [`../security/README.md`](../security/README.md) — audit-ledger entries reference these codes. - [`../../templates/ADR.template.md`](../../templates/ADR.template.md) — error-namespace renames go through an ADR. ==== https://playbook.agentskit.io/docs/pillars/architecture/event-streaming-pattern --- title: 'Event Streaming Pattern' description: 'How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers.' --- # Event Streaming Pattern How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers. ## TL;DR (human) Three primitives: **queues** (work-distribution; one consumer per message), **pub/sub topics** (broadcast; many independent consumers), **event streams** (durable log; replayable; ordered per partition). Choose by use case. Discipline: idempotency on every consumer, dead-letter queue, schema evolution, replay tooling, backpressure handling. The hardest mistakes are subtle — re-delivery semantics, ordering guarantees, exactly-once myths. ## For agents ### Three primitives | Primitive | Semantics | Use case | Examples | |---|---|---|---| | **Queue** | At-least-once; one consumer per msg; FIFO or fair | Work distribution (jobs) | SQS, RabbitMQ, BullMQ | | **Pub/sub topic** | At-least-once; fanout to N subscribers; usually no ordering | Notify many independent consumers | SNS, Cloud Pub/Sub, Redis pub/sub | | **Event stream** | Durable log; ordered per partition; replayable | Event-sourced systems; analytics pipelines | Kafka, Kinesis, NATS JetStream, Redpanda | Mixing primitives is normal (queue + topic + stream). Picking the wrong one is costly. ### Delivery semantics — the truth Three theoretical options: - **At-most-once**: messages may be lost; never duplicated. - **At-least-once**: messages always delivered; may be duplicated. - **Exactly-once**: each message processed exactly once. In practice: - Most production systems are **at-least-once**. - "Exactly-once" usually means **at-least-once + idempotent consumer**. - True end-to-end exactly-once exists in some systems (Kafka transactions + transactional sinks) but is expensive and narrow. **Design for at-least-once + idempotency.** It is the most cost-effective and most robust pattern. ### Idempotency — non-negotiable Every consumer must handle duplicate delivery. Pattern: ```ts async function handler(msg: Message) { const idempotencyKey = msg.headers["x-idempotency-key"] ?? msg.id; // Has this been processed? const existing = await db.processedMessages.findUnique({ where: { idempotencyKey } }); if (existing) { logger.info("duplicate.skipped", { idempotencyKey }); return existing.result; } // Process atomically with idempotency record. return await db.transaction(async (tx) => { const result = await doWork(msg, tx); await tx.processedMessages.create({ data: { idempotencyKey, result } }); return result; }); } ``` The idempotency record + the side-effect commit in **one transaction**. Half-states get the producer to retry safely. Where transactions cross stores (e.g. external API + local DB), apply the outbox pattern (see [`distributed-data-pattern.md`](./distributed-data-pattern.md)). ### Ordering guarantees | Primitive | Ordering | |---|---| | Standard SQS | No order guaranteed | | SQS FIFO | Per message-group ordered | | Kafka / Kinesis | Per partition ordered | | Redis Streams | Per stream ordered | | RabbitMQ classic queues | Per queue ordered (but with consumer caveats) | The shard / partition key chooses ordering scope. Common choice: tenant id (events per tenant ordered; cross-tenant unordered). If the consumer needs global order, you have one partition; you have one consumer's throughput; you have a bottleneck. Avoid. ### Dead-letter queue (DLQ) Messages that fail repeatedly route to DLQ: - After N retries (e.g. 5). - Or after specific terminal errors (validation failure, missing entity). DLQ is **inspected** — manually or via tooling. Each DLQ message is a bug: - The message is malformed (producer bug). - The handler has a regression (consumer bug). - An upstream dependency is permanently down (deeper issue). Never silently delete DLQ. Inspect; fix; replay. ### Backpressure When consumers are slow relative to producers, the queue grows. Options: - **Auto-scale consumers**: more workers; faster drain. Bounded by downstream capacity (DB, external APIs). - **Producer back-pressure**: producers slow down on queue-depth signal. Hard to retrofit. - **Drop oldest** (queue-depth cap): some workloads tolerate it (notifications). Most don't. - **Spillover**: route to slower / cheaper storage at queue-depth threshold. The wrong answer is to silently fall behind. Set alerts on queue depth + age of oldest message. ### Schema evolution Producers + consumers deploy independently. Their schemas must coexist across versions. Rules: - **Add fields**: new fields are optional with defaults. Old consumers ignore. - **Rename fields**: requires deprecation cycle (per [`api-versioning-pattern.md`](./api-versioning-pattern.md)) — keep both names during transition. - **Remove fields**: requires guarantee no consumer reads them. Audit; deprecate; remove. - **Type changes**: breaking; new event name preferred. A **schema registry** (Confluent Schema Registry, Glue Schema Registry, in-house) enforces compatibility: - Producer registers schema at publish time. - Compatibility check: would old consumers parse this? - Reject incompatible schemas at publish. Without a registry, schema drift produces consumer crashes that are hard to diagnose. ### Replay + reprocessing For event streams (durable): - **Replay from offset**: rewind a consumer; reprocess from N. - **Replay to dev**: snapshot prod stream; replay locally. - **Backfill**: a new consumer joins; processes the full history. For queues (non-durable): - Replay = manually re-publishing from logs / archive. Tooling discipline: a replay command exists; it's safe; it's tested. ### Idempotency across replays Replaying produces duplicates. The same idempotency-key pattern handles it — as long as the idempotency-keys are stable across runs. Counter-example: `idempotencyKey = uuid()` generated at processing time → every replay produces a "new" message. Stable keys are essential. ### Event sourcing — the heavyweight pattern Some systems persist *only* the event stream; current state is a projection. Pros: full audit trail; replay rebuilds state; new projections retroactively serve new use cases. Cons: every query goes through projections; schema evolution is hard; migrations are replays. **Adopt event sourcing deliberately**, not by accident. It is a significant architecture commitment. For most products: regular CRUD with an outbox of domain events is the right balance. Full event sourcing for systems where the event history IS the value (audit, financial systems, multi-step workflows). ### CQRS — the companion pattern Command Query Responsibility Segregation: write models and read models differ. - Commands go to one shape (often the event stream). - Queries hit one or more projections optimised for the query shape. Useful when: - Read and write loads are very different. - Multiple read projections benefit from the same write events. - Eventual consistency is acceptable for reads. Overhead: two models to maintain; eventual consistency to communicate. ### Event-driven UX When the user triggers an action that goes async: - **Optimistic UI**: show success immediately; reconcile on event-back. - **Status surface**: explicit "running…" / "completed" / "failed". - **Idempotent retries**: user clicks twice; second click finds the in-flight job. Don't hide async-ness from the user; surface it. ### Cost concerns Event-streaming infrastructure costs: - **Per-message**: SQS, SNS price per million. - **Per-throughput**: Kafka, Kinesis charge for provisioned throughput. - **Per-storage**: retention beyond 7 days = paid storage. Tuning: - Batch publishes where latency allows. - Compress payloads (Snappy, gzip). - Tune retention to actual replay window. - Per-tenant tagging for attribution (per [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md)). ### Anti-patterns - **Synchronous-over-async**: producer blocks waiting for consumer ack. Defeats decoupling. - **Event names that encode internal state**: `UserRowVersion3UpdatedColumnX`. Producers leak DB structure to consumers. - **Fat events**: 100 KB payloads. Consumers parse the whole world. → Small events + reference to canonical store. - **Anaemic events**: just an id. Consumers re-fetch everything. → Include enough for common consumers. - **Topic per consumer**: defeats decoupling. → One topic; many consumers. - **No DLQ**: failing messages retry forever; queue grows; outage. → DLQ + alerts. - **No idempotency**: duplicates produce double-charges, double-emails. → Idempotency-key everywhere. ### Common operational failures - **Consumer lag spike** → backpressure; investigate downstream. - **DLQ filling** → consumer regression; inspect first message. - **Schema deploy breaks consumers** → registry was bypassed; rollback; enforce registry. - **Replay duplicated work** → idempotency-key not stable. - **Lost messages** → at-most-once setting; switch to at-least-once + ack. - **Out-of-order in supposedly-ordered partition** → consumer-side concurrency violates ordering; serialize. ### Tooling stack (typical) | Primitive | Tool | |---|---| | Managed queue | AWS SQS, GCP Tasks, Azure Service Bus | | Self-hosted queue | RabbitMQ, BullMQ (Redis-backed) | | Pub/sub | AWS SNS, GCP Pub/Sub, NATS | | Event stream | Kafka, Confluent Cloud, Redpanda, AWS Kinesis, Azure Event Hubs | | Schema registry | Confluent SR, AWS Glue SR, in-house JSON schema repo | | Workflow engine | Temporal, AWS Step Functions, Inngest | | Job scheduler | BullMQ, Sidekiq, Celery | ### See also - [`distributed-data-pattern.md`](./distributed-data-pattern.md) — outbox pattern feeds event streams. - [`api-versioning-pattern.md`](./api-versioning-pattern.md) — schema evolution rules. - [`anti-overengineering.md`](./anti-overengineering.md) — event sourcing is the canonical premature complexity. - [`../quality/observability-pattern.md`](../quality/observability-pattern.md) — queue depth, lag, DLQ size are key metrics. - [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md) — event-stream cost. - [`../security/audit-ledger-pattern.md`](../security/audit-ledger-pattern.md) — append-only ledger is a specialized event stream. ==== https://playbook.agentskit.io/docs/pillars/architecture/feature-flags-pattern --- title: 'Feature Flags Pattern' description: 'How to ship code separately from shipping behavior, without accumulating a flag graveyard.' --- # Feature Flags Pattern How to ship code separately from shipping behavior, without accumulating a flag graveyard. ## TL;DR (human) Feature flags decouple deploy from release. Code lands in main behind a flag, default off. Flags are typed: release / experiment / ops / kill-switch / permission. Every flag has an owner + retirement date. Flags retire as zealously as they are added — stale flags accumulate complexity worse than the alternative they were meant to avoid. ## For agents ### Flag taxonomy Five types. Each has different lifecycle and ownership. | Type | Purpose | Lifetime | Owner | |---|---|---|---| | **Release** | Hide unfinished features; flip when ready | Days–weeks | Feature team | | **Experiment** | A/B test variants | Days–weeks (until decision) | Product / experiment owner | | **Operational** | Ramp / canary / kill expensive code paths | Permanent (but value changes) | Ops / SRE | | **Kill-switch** | Emergency disable of a feature in prod | Permanent | Ops / SRE | | **Permission / entitlement** | Per-tenant / per-plan feature gating | Permanent | Product | Confusing one type for another is the #1 source of flag debt. ### Flag definition shape ```ts type FeatureFlag = { key: string; // "users.invite-flow.v2" type: "release" | "experiment" | "operational" | "kill-switch" | "permission"; description: string; // What does flipping it do? owner: string; // Single accountable person / team. defaultValue: boolean | string | number; createdAt: string; // ISO date. retireAt?: string; // ISO date. REQUIRED for release / experiment types. rollout?: { workspaceIds?: string[]; // explicit allowlist percentage?: number; // 0..100 for gradual ramp rules?: Array<{ attr: string; op: string; value: unknown }>; // attribute-based }; }; ``` `retireAt` is mandatory for release and experiment flags. The flag definition is rejected if missing. ### Reading a flag ```ts const enabled = flags.evaluate("users.invite-flow.v2", ctx); if (enabled) { // new path } else { // old path } ``` The `evaluate` call: - Reads workspace / user attributes from `ctx` (never from request body — see security Rule 2). - Applies rollout rules in order: kill-switch override → permission gate → operational override → experiment assignment → release flag. - Caches per-`(flag, ctx)` for the duration of the request. - Logs the evaluation (sampled) for analytics. ### Naming conventions `\.\.\`: - `users.invite-flow.v2` (release) - `billing.pricing-table.experiment-q3` (experiment) - `runtime.flow-execution.parallel-handlers` (operational) - `payments.charge.kill-switch` (kill-switch) - `tenants.custom-domain` (permission) Discipline: no `enable_X` / `feature_Y` / `useNewX` — those drift. ### Flag retirement Retirement is mandatory and tracked. Sequence: 1. **Pick the winner.** For release: the new path. For experiment: whichever variant won. 2. **Make it the default in code.** Replace `if (flag) { newPath } else { oldPath }` with just `newPath`. 3. **Delete the loser path.** This is the point. Keeping both paths "in case" is the trap. 4. **Delete the flag definition.** From flag registry, from any docs. 5. **Audit-log the retirement.** A retirement PR is a clean revert of the flag-introduction PR. If retirement is hard, the original PR did too much. ### Retirement enforcement A gate scans: - Flag definitions with `retireAt < today` → fail. - Flag references in code where the flag definition no longer exists → fail (stale code). - Flag definitions with no references in code for > 30 days → warn (likely abandoned). This prevents flag graveyard accumulation. A flag past retirement is debt; surface it. ### Kill-switch discipline Kill-switches are permanent flags, but they have constraints: - **Always default ON**, kill action is "flip to off". - **Per-tenant override allowed** (mute a noisy customer's expensive feature). - **Documented runbook**: when to flip, what user-visible effect, expected recovery time. - **Flipping is audit-logged** with operator id + reason. A kill-switch you cannot find when production is on fire is worse than no kill-switch. ### Storage Flag values live in: - **In-process default**: the flag definition's `defaultValue` (bootstrap fallback). - **Centralized config store**: durable; per-environment + per-tenant overrides. - **Edge / runtime override**: fast path for kill-switch flips. Mutations are audit-logged: who flipped, when, what value, from what state. ### Experiment-specific concerns Experiments need additional discipline: - **Pre-registered hypothesis**: what you expect to see; what would change the call. - **Sample size + power calculation**: how long until you have enough data. - **Stop conditions**: when do you call it. - **One experiment per metric per surface at a time**: parallel experiments confound results. Treat experiments as time-boxed. An experiment past its stop date is a stale flag. ### Permission flags (per-tenant entitlements) Permission flags differ from feature flags in semantics: - Persistent (not retired). - Tied to plan / contract terms. - Visible in the product (the user can see "you don't have this on your plan"). - Linked to billing. Implement these via plan presets in the whitelabel runtime (see [`../ui-ux/whitelabel-pattern.md`](../ui-ux/whitelabel-pattern.md)), not via the feature-flag system. Mixing the two is confusing. ### Common failure modes - **Release flag that lives forever.** Code has both paths permanently. → Mandatory `retireAt`; retirement gate. - **Experiment with no stop condition.** Runs forever; nobody calls it. → Pre-registered stop conditions. - **Permission flag in feature-flag system.** Retirement gate flags it; team adds bogus `retireAt` to silence the gate. → Separate system; clear semantics. - **Flag value read at module-import time.** Cannot change without restart. → Always evaluate per request / per call site. - **Flag evaluation in security-critical paths without falling back to safe default.** Network blip → flag returns undefined → behavior is wrong. → Default-safe values; circuit-break on store failure. - **Branching deep inside a function on a flag.** Function does two things; tests have to mock the flag. → Branch at the call site; pass the chosen function down. ### See also - [`anti-overengineering.md`](./anti-overengineering.md) — flags should not be the default; YAGNI applies. - [`../security/universal.md`](../security/universal.md) — flag flips audit-logged. - [`../ui-ux/whitelabel-pattern.md`](../ui-ux/whitelabel-pattern.md) — permission flags via plan presets. - [`../quality/quality-gates-pattern.md`](../quality/quality-gates-pattern.md) — retirement gate. ==== https://playbook.agentskit.io/docs/pillars/architecture/file-size-budget --- title: 'File-size Budget' description: 'How to keep files reviewable when agents write most of the code.' --- # File-size Budget How to keep files reviewable when agents write most of the code. ## TL;DR (human) Agents will happily produce 1500-line files. Reviewers cannot read 1500-line files. A per-extension line budget, enforced as a CI gate with a shrink-only baseline, forces extractions at the right moment — when the file is still small enough to split cleanly. ## For agents ### Budgets (calibrated) | Extension | Budget (lines) | Notes | |---|---|---| | `.tsx` (React components / screens) | 300 | Forces sub-component extraction at the right granularity | | `.ts` (logic modules) | 500 | Enough for a real module; small enough to scan | | `.test.ts(x)` | 800 | Table-driven tests legitimately get long | | `.md` | none | Docs are linear; agents don't get confused inside long docs | | `.json` (data) | none | Generated / data files exempt | Adjust per language: Go and Rust modules legitimately run larger (target 800 / 1000); Python target 500. Measure: physical lines (`wc -l`), not "non-blank non-comment". Blank lines and JSDoc are part of readability; counting them keeps the budget honest. ### The gate Mode: **shrink-only baseline**. 1. Generate a baseline JSON listing every file currently over budget with its current line count. 2. On every CI run, recompute. For each baselined file: fail if it grew. For each file not in baseline: fail if it exceeds budget. 3. Baseline regenerates only on intentional sweeps (a PR that removes entries explicitly). Why shrink-only: prevents adoption from blocking the whole repo on day one; prevents drift from making it worse. Reference impl: [`../../scripts/check-file-size.example.mjs`](../../scripts/check-file-size.example.mjs). The baseline file lives at `.file-size-baseline.json` in the repo root. ### Extraction patterns When the gate fires, do not lower the budget. Do not split into `\-2.tsx`. Extract intentionally: **React component over 300 lines** → identify sub-renders: ```tsx // Before: dashboard.tsx — 420 lines export function Dashboard() { // ... 100 lines of state // ... 90 lines of header JSX // ... 120 lines of body JSX // ... 80 lines of footer JSX // ... 30 lines of handlers } // After: dashboard.tsx — 90 lines import { DashboardHeader } from "./parts/dashboard-header"; import { DashboardBody } from "./parts/dashboard-body"; import { DashboardFooter } from "./parts/dashboard-footer"; import { useDashboardState } from "./use-dashboard-state"; export function Dashboard() { const state = useDashboardState(); return ( ); } ``` The `parts/` convention is enforced: extractions go in a sibling `parts/` directory, not a top-level `components/`. Keeps the screen's surface area local. **Logic module over 500 lines** → identify cohesive responsibilities: - One file per public function family. If a 500-line file has CRUD for two unrelated entities, split by entity. - Helpers move to `\-helpers.ts`; types to `\-types.ts`. ### Gate ergonomics To prevent agents grinding against the budget: - The error message says **exactly which file is over budget, by how much, and what the budget is**. Not "size check failed". - The pre-commit hook runs the gate so the agent learns at commit time, not at push time. - The gate has a flag `--explain` that prints the recommended extraction pattern. ### Common failure modes (sourced from production) - **Agent inlines a multi-line ternary just under budget instead of refactoring.** File passes but is now harder to read. → Pair the size gate with a separate "no nested ternary" lint and a complexity-per-function gate (max cyclomatic ~14). - **Agent renames the file to dodge the baseline.** New name is under budget by accident; baseline forgets the old one. → Gate hashes file content; an unchanged content under a new name still counts. - **Agent splits the file into `dashboard.tsx` + `dashboard-2.tsx`.** Worse than the original. → Lint bans `-N.tsx` numeric suffixes. - **Baseline grows over time because no one shrinks it.** → Set an explicit "baseline shrink-only" rule plus a separate per-quarter goal: pick the top-N largest baselined files and extract. ### Calibration If the budget is too tight, agents waste cycles. If too loose, files get unreviewable. Recalibrate based on: - Reviewer feedback: "I cannot read this in one sitting" → tighten. - Extraction noise: too many `parts/parts/parts/` chains → loosen, or revisit the design — the component is doing too much. - Test files growing: legitimately table-driven; 800 is permissive on purpose. ### See also - [`universal.md`](./universal.md) — Rule 6 (file-size budget). - [`../quality/README.md`](../quality/README.md) — gate wiring. - [`../../scripts/check-file-size.example.mjs`](../../scripts/check-file-size.example.mjs) — reference impl. ==== https://playbook.agentskit.io/docs/pillars/architecture/iac-pattern --- title: 'Infrastructure as Code (IaC) Pattern' description: 'How to define + version + review + apply infrastructure as code, instead of clicking around cloud consoles.' --- # Infrastructure as Code (IaC) Pattern How to define + version + review + apply infrastructure as code, instead of clicking around cloud consoles. ## TL;DR (human) Every cloud resource is described in code (Terraform / Pulumi / CDK / CloudFormation). Code lives in git, reviewed via PR, applied via CI. Drift detection alerts when reality differs from code. Modules abstract reusable patterns. State management is the operational hard problem; secure + back it up. ## For agents ### Why IaC - **Reviewable**: changes visible as PRs. - **Reproducible**: spin up identical environments. - **Versioned**: history of every change. - **Auditable**: who applied what when. - **Testable**: validate before apply. - **Reusable**: modules across teams + environments. Clicking in the cloud console: - Unreviewable. - Reality drifts. - Disaster recovery from scratch = days. - No history. ### Tooling | Tool | Style | Notes | |---|---|---| | **Terraform / OpenTofu** | Declarative HCL; multi-cloud | Most popular; large module ecosystem | | **Pulumi** | Code (TS/Python/Go); declarative | Multi-cloud; type-safe | | **AWS CDK** | TypeScript / Python; generates CloudFormation | AWS-only; deep integration | | **CloudFormation** | YAML/JSON | AWS-native; verbose | | **Bicep** | Microsoft-native DSL | Azure-only | | **Crossplane** | k8s-native; provisions cloud resources via CRDs | k8s-first orgs | Most teams: Terraform / OpenTofu (or Pulumi). Aim for one IaC tool across the org. ### Code structure ``` infra/ ├── modules/ # reusable building blocks │ ├── vpc/ │ ├── eks-cluster/ │ ├── rds-postgres/ │ └── service/ # generic service module ├── envs/ # per-environment config │ ├── dev/ │ │ └── main.tf # imports modules with dev values │ ├── staging/ │ └── prod/ └── README.md ``` Modules are reusable; envs compose modules with environment-specific values. ### Module discipline A module: - Has clear inputs (variables) + outputs. - Has versioning (tagged in git or registry). - Has README + example usage. - Has tests (Terratest, Pulumi unit tests). - Has minimal blast radius (a "VPC module" doesn't reach into compute). Cross-cutting modules (security groups, IAM roles, observability defaults) capture conventions per [`anti-overengineering.md`](./anti-overengineering.md): write once, reuse N+ times. ### State management Terraform state is a JSON file describing what's deployed. Treat it as critical: - **Remote backend** (S3 + DynamoDB lock, GCS, Terraform Cloud, Pulumi service): never local-only in prod. - **Encryption at rest**: backend-level. - **Locking**: prevents concurrent applies corrupting state. - **Backup**: state file backup retention; recover from corruption. - **Per-env state**: dev / staging / prod separate; blast-radius bounded. State drift (reality differs from state) is the common operational issue. `terraform plan` regularly to detect. ### Workflow ``` 1. Engineer writes / changes Terraform code. 2. PR opens. 3. CI runs `terraform plan` against target env; comments plan on PR. 4. Reviewer reads the plan; approves changes. 5. PR merges. 6. CI runs `terraform apply` (or manual approval gate). 7. State updates; resources change. 8. Smoke checks confirm health. ``` Direct `terraform apply` outside the pipeline = bypass; treat as anti-pattern. ### Plan as the review artifact `terraform plan` outputs: - What will be created. - What will be updated. - What will be **destroyed** (critical to spot). The plan is the reviewable thing. A PR with a plan showing "destroy production database" is caught here. CI posts the plan to the PR; reviewer reads. ### Drift detection Reality vs state: - Someone clicked in the console. - A different tool made changes. - An incident response did emergency changes that weren't codified. Detection: - Scheduled `terraform plan` (CI cron); alerts on diff. - Cloud-native drift detection (AWS Config, CloudFormation drift, Pulumi drift detection). - Manual review periodically. When drift detected: either codify the change (write the IaC) or revert. Don't ignore. ### Secrets in IaC Don't commit secrets to IaC. Patterns: - **Variable references to vault**: IaC sets the connection to vault; secret values flow at runtime. - **SOPS-encrypted variable files**: works for GitOps where the deploy operator has the decrypt key. - **External Secrets Operator**: k8s case; IaC creates the reference, ESO fetches the value. Common mistake: secrets in Terraform state. State files contain plaintext of all values; readable by anyone with state access. ### Modular vs monolith repo Two patterns: - **Mono-IaC repo**: all infra in one repo; cross-team coordination. - **Per-service infra**: each service repo includes its infra; smaller scope. Mono-IaC for early-stage; per-service for mature with mature platform team. ### Testing IaC Yes, IaC needs tests: - **Validate** (`terraform validate`): syntax + provider rules. - **Format** (`terraform fmt -check`). - **Lint** (tflint, checkov): security + best-practice rules. - **Plan**: a successful plan IS a kind of test. - **Integration**: Terratest, Pulumi unit tests — create real infra in a sandbox; assert; tear down. - **Policy as code** (Sentinel, OPA, Checkov): "this PR cannot create a public S3 bucket without explicit exception". ### Cost forecasting Before apply: - Infracost analyses the plan; estimates monthly cost. - Posts to PR. - Reviewer sees "this PR adds $1200/mo". Cost-aware reviews catch over-provisioning before it ships. ### Disaster recovery via IaC A region failure → re-create infra in another region via the same IaC. If you can't, IaC isn't actually capturing the system. Drill: tear down a non-prod env; re-create via IaC; confirm functional. Quarterly. ### GitOps GitOps is IaC for runtime config (k8s manifests; not just cloud resources): - Manifests in git. - A controller (Flux, Argo CD) reconciles cluster state to git state. - Changes go through PR. GitOps for k8s; IaC for cloud resources. Often combined: IaC creates the cluster; GitOps manages workloads on it. ### Common failure modes - **Click-ops alongside IaC**. Drift constant; IaC distrusted. → Console access restricted; ops via IaC. - **State in local file**. Lost laptop = lost knowledge of what exists. → Remote backend. - **Secrets in state**. Visible to anyone with state read. → Vault / external; no secrets in IaC. - **`destroy` in plan unnoticed**. PR merged; prod DB gone. → Reviewers read the plan; tag PR with destroy warning. - **No drift detection**. Reality drifts; IaC doesn't reflect; future apply destroys reality. → Scheduled plan. - **`apply` outside CI**. Bypass review. → Cloud IAM prevents direct apply. - **One mega-module**. Hard to test; hard to reuse. → Small composable modules. ### Tooling stack (typical) | Concern | Tool | |---|---| | Core IaC | Terraform / OpenTofu, Pulumi, AWS CDK | | Modules registry | Terraform Registry, private registries | | Policy | Sentinel (TF Cloud), OPA, Checkov, tfsec | | Cost forecast | Infracost | | Drift detection | Driftctl, CloudQuery, Atlantis | | GitOps (k8s) | Flux, Argo CD | | Test | Terratest, Pulumi Test, Pester (PowerShell) | | Lint | tflint, checkov, tfsec | | Backend | S3 + DynamoDB, GCS, Terraform Cloud, Pulumi service | ### Adoption path 1. **Day 0**: new resources go through IaC. Existing manually-created: import incrementally. 2. **Month 1**: per-env state; CI plan-on-PR. 3. **Month 2**: cost forecasting; policy gates. 4. **Quarter 1**: modules for reusable patterns; tests for them. 5. **Quarter 2+**: DR drill; drift detection; GitOps for k8s. ### See also - [`platform-engineering-idp-pattern.md`](./platform-engineering-idp-pattern.md) — platform team owns shared modules. - [`multi-region-pattern.md`](./multi-region-pattern.md) — IaC enables region duplication. - [`../security/container-k8s-security-pattern.md`](../security/container-k8s-security-pattern.md) — k8s defined via IaC. - [`../security/secrets-mgmt-deep-pattern.md`](../security/secrets-mgmt-deep-pattern.md) — secrets not in IaC. - [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md) — IaC enforces tagging. - [`../quality/ci-cd-pipeline-pattern.md`](../quality/ci-cd-pipeline-pattern.md) — IaC pipeline. ==== https://playbook.agentskit.io/docs/pillars/architecture/multi-region-pattern --- title: 'Multi-Region Pattern' description: 'How to operate across geographic regions for latency, availability, and data sovereignty — without losing your mind.' --- # Multi-Region Pattern How to operate across geographic regions for latency, availability, and data sovereignty — without losing your mind. ## TL;DR (human) Multi-region is **operationally hard** and adds permanent complexity. Adopt only when at least one of three reasons holds: user latency forces it (global product), availability requires it (one region cannot be a single point of failure), data sovereignty mandates it (regulations require data-in-country). Otherwise stay single-region; vertical-scale longer. ## For agents ### Three reasons to go multi-region | Reason | Symptom | Common minimum effective response | |---|---|---| | **Latency** | Users on other continents experience > 200ms RTT | CDN + edge cache for read-heavy; second region for write-heavy | | **Availability** | Single-region outage = product down; SLA in jeopardy | Active-passive with documented failover | | **Data sovereignty** | GDPR / LGPD / data-residency laws | Per-region data store; per-tenant region pinning | If none of these holds, multi-region is overhead. Revisit yearly. ### Active-passive (start here) - **One write region**, others stand by. - Asynchronous replication: writes go to primary; replicas in other regions catch up. - **Failover**: operator action; promotes a replica to primary. RTO = minutes; RPO = replication lag. - **Reads**: can route to nearest region (with replication-lag tolerance) or always to primary (for strict consistency). Pros: simple; one source of truth; well-understood failure modes. Cons: failover requires action; cross-region writes are slow (round-trip to primary); no horizontal write scaling. ### Active-active with partition leaders - Each **partition** (tenant, geographic block, customer) has a leader region. - Writes for that partition succeed only in its leader region; reads can serve elsewhere. - Failover: leader for a partition moves to another region; partition-level RTO = minutes; per-partition RPO = replication lag. Pros: no global single point of failure; horizontal write scaling. Cons: cross-partition operations are expensive (multi-region transactions); routing logic per write. ### Active-active with conflict resolution (CRDT / multi-leader) - Any region can write any data. Conflicts merge automatically (CRDT) or are resolved by application logic. - Reads serve from nearest region. - Strong eventual consistency. Pros: zero failover time for writes; lowest user-perceived latency. Cons: conflict resolution complicates application logic; not all data shapes have natural merge functions (counters yes; strings no); expensive to retrofit. Most products do not need this tier. Adopt only after exhausting partition-leader. ### Failover discipline A failover plan is a runbook with: 1. **Triggers**: what conditions justify failover? (region down ≥ 5 min; sustained error rate > X%; manual override.) 2. **Decision authority**: who triggers? (Sometimes automated; usually human-in-loop for non-trivial systems.) 3. **Steps**: exact commands, in order, with expected duration per step. 4. **Verification**: how to confirm the failover worked. 5. **Rollback**: if the failover itself fails, how to revert. 6. **Communication**: who is told (engineering, support, customers). **Drilled quarterly**. Untested failover plans do not work when needed. ### Data residency (sovereignty) When regulations require data-in-country: - **Per-tenant region pinning**: tenant's data writes go only to their region. - **Schema includes residency tags**: each record knows where it lives. - **Egress is region-aware**: a query against tenant X never touches storage outside X's region. - **Audit logs are also region-pinned** (sometimes regulator-specific). The boundary is the database, not the application. Application-level "always filter by region" is fragile; storage-level partitioning is durable. ### Geo-DNS / global load balancing Front the system with: - **Geo-DNS**: route DNS to nearest region. - **Anycast IP**: same IP everywhere; BGP routes to nearest. - **CDN / edge**: cached responses served close to user; cache miss routes to region. The front layer is invisible to the application most of the time; it surfaces when a region is failing (geo-DNS / health checks should remove the failing region from rotation). ### Cross-region call discipline Every call that crosses a region boundary is **slow** (tens to hundreds of ms RTT). Discipline: - **Cache aggressively** at the consumer. - **Batch** cross-region calls. - **Avoid synchronous fanout** — N parallel calls to N regions = latency = max of all. - **Idempotency required** — cross-region calls retry on network blip; non-idempotent retries corrupt state. ### Cost Multi-region triples (or more) infrastructure cost: - 3 regions = 3 copies of every service + 3 copies of every store + cross-region replication bandwidth. - Operational cost rises: monitoring + on-call rotation per region + region-aware incident response. Budget accordingly. Multi-region is not free; the business case must justify the cost. ### Per-pillar concerns at multi-region scale **Security**: - Vault per region (sealer keys stay in their region). - Audit ledger per region; cross-region verification. - RBAC scope checks region-aware. **UI-UX**: - User-perceived latency drops dramatically (the point). - Failover during user session: UI must handle abrupt error + retry cleanly. **Quality**: - Tests cover multi-region scenarios (a partition test that pins a tenant to a region, then asserts the data does not appear in another). - Failover game days quarterly. **Governance**: - RFC any cross-region contract change (every region must agree). ### Common failure modes - **Adopting multi-region for "scale"** before single-region is exhausted. → Vertical-scale first. The single-region ceiling is high. - **Active-active write conflict.** Two writes to the same record in two regions; one is lost without anyone noticing. → CRDT or partition-leader; never silent last-writer-wins. - **Failover that has never been drilled.** Crisis day: runbook is wrong. → Quarterly drill. - **Cross-region call inside a hot loop.** N × M latency = user waits seconds. → Cache or restructure. - **Region-aware code mixed with non-region-aware code.** Mistakes inevitable. → All region-aware code goes through one region-router module. - **Data sovereignty enforced in app code only.** A bypass leaks data cross-region. → Enforce in storage / network policy. ### When to roll back Multi-region is sometimes a mistake. Rollback signals: - Operational cost outweighs benefit. - Engineering velocity craters because every change touches N regions. - Failovers happen rarely; when they do, they don't work. Rollback = consolidate to one region; tear down the rest in a careful migration. The cost of being multi-region wrong is high; honest evaluation matters more than sunk cost. ### See also - [`distributed-data-pattern.md`](./distributed-data-pattern.md) — sharding + replication primitives. - [`anti-overengineering.md`](./anti-overengineering.md) — multi-region is the canonical premature-flexibility trap. - [`../security/multi-tenant-isolation-pattern.md`](../security/multi-tenant-isolation-pattern.md) — tenant-aware region pinning. - [`../security/vulnerability-mgmt-pattern.md`](../security/vulnerability-mgmt-pattern.md) — patch cadence across regions. - [`../quality/observability-pattern.md`](../quality/observability-pattern.md) — per-region SLOs. ==== https://playbook.agentskit.io/docs/pillars/architecture/offline-first-sync-pattern --- title: 'Offline-First + Sync Pattern' description: 'How to design apps that work without network connectivity and reconcile state when connectivity returns.' --- # Offline-First + Sync Pattern How to design apps that work without network connectivity and reconcile state when connectivity returns. ## TL;DR (human) Offline-first means the local copy is the truth; sync to remote when possible. Designed right: instant interactions, work-anywhere, automatic reconciliation. Designed wrong: lost data, conflicting state, user confusion. Three problems to solve: local persistence, sync protocol, conflict resolution. CRDTs simplify the last; explicit policy works otherwise. ## For agents ### Three concerns | Concern | Question | Tools | |---|---|---| | **Local persistence** | Where does data live offline? | IndexedDB (web), SQLite (mobile), file system (desktop) | | **Sync protocol** | How do client + server reconcile? | Custom REST + diff, GraphQL subscriptions, CouchDB-style replication | | **Conflict resolution** | When local and remote disagree, who wins? | CRDT auto-merge, last-writer-wins, manual resolution UX | ### Local persistence **Web**: IndexedDB (browser-native; large quotas; structured) > localStorage (small; sync). Wrappers: Dexie, idb, RxDB. **Mobile**: SQLite (cross-platform; mature; queryable) > AsyncStorage (key-value only). Wrappers: WatermelonDB, Realm. **Desktop**: SQLite or file-based; per-platform OS-native options. The local store mirrors the server's data shape. Reads are local; writes are local-first (then sync). ### Sync protocol Sync protocol is the contract between client + server. Three styles: **Full replication**: client downloads everything for its scope. Re-sync == replace. Works for small datasets per user. **Incremental sync**: - Client tracks last-sync timestamp / version. - Server returns changes since. - Client applies; updates last-sync. Requires server-side change log (per [`event-streaming-pattern.md`](./event-streaming-pattern.md)) or efficient timestamp queries. **Operational transform / CRDT replication**: each side captures operations or CRDT updates; merge convergent regardless of order. ### Operation log on client Local writes captured as operations (not just state mutations): ```ts type LocalOp = { id: string; type: "create" | "update" | "delete"; entity: string; payload: unknown; clientTimestamp: number; status: "pending" | "synced" | "failed"; }; ``` Operations sit in a local queue. When online: replay in order to server; mark synced. Server returns canonical-state diffs. ### Conflict resolution strategies | Strategy | When | |---|---| | **Last-writer-wins (LWW)** | Strict total order via timestamps; data class allows loss | Reasonable default for collaborative-but-not-critical | | **First-writer-wins** | "Once set, immutable" semantics | | **CRDT** | Math guarantees convergence regardless of order | Counters, sets, sequences, text (Yjs, Automerge) | | **Manual** | UI shows conflict; user picks | High-stakes data; rare | | **Custom semantic** | Domain-specific merge | When math doesn't help | CRDT is the cleanest if your data shape fits (counters, text editors, lists, sets). Outside those shapes: LWW + careful UX. ### Clock discipline Conflict resolution often uses timestamps. Clock skew breaks it: - Use server time when sync happens (server stamps). - Hybrid Logical Clocks (HLC) for distributed correctness without strict NTP. - Lamport / vector clocks for strict ordering. Don't use client wall-clock alone — devices drift, users adjust manually. ### Sync state machine per record ``` local-only → syncing → synced local-only → syncing → failed → retry synced → modified-locally → syncing → ... ``` Each record's status determines UI behaviour: - `local-only`: badge "Saving locally". - `syncing`: subtle activity indicator. - `synced`: clean. - `failed`: error state with retry. ### Offline UX Tell the user: - **Network status**: visible (header banner when offline). - **Sync progress**: subtle (post-success), explicit (when active sync is non-trivial). - **Per-record state**: badges or icons for in-progress / failed. - **Conflicts** (rare): explicit UX to resolve. Never silently lose data. If a sync fails permanently, surface it. ### Authentication offline Tricky: - Tokens issued before going offline still work locally. - Refresh fails offline; access token may expire mid-offline session. - Permission changes on server don't reach client until online. Pattern: cached permissions; long-lived offline access token; re-auth when online if expired during offline window. Step-up operations (per [`../security/session-mgmt-pattern.md`](../security/session-mgmt-pattern.md)) require online. ### Service workers (web) For PWAs: - Service worker intercepts requests. - Cache strategies: cache-first (static assets), network-first (API), stale-while-revalidate. - Background sync API: queue writes; replay when online. Service workers add complexity; only adopt when offline is a core product requirement. ### Sync at scale For each tenant in multi-tenant systems: - Sync state per (user, device). - Sync windows: tenants don't see other tenants' streams. - Large dataset users may need progressive sync (paginated by time / entity). ### Local search Local data enables local search: - IndexedDB indexes for fast lookup. - SQLite FTS5 for full-text search. - Bloom filters for membership tests in larger datasets. Search latency dominates UX; local search makes the app feel instant. ### When NOT offline-first - Single-session web tools. - Sensitive data that shouldn't sit on client devices. - Truly real-time-only (live video). - Trivial CRUD where online assumption is acceptable. Offline-first is a significant architecture commitment. Adopt when network unreliability is part of the product reality. ### Common failure modes - **Local data + no sync UI**: user wonders if their work persisted. → Per-record status. - **LWW without HLC**: clock skew = wrong winner. → HLC or server stamps. - **No conflict UX for high-stakes data**: silent loss. → Manual resolution; never silently overwrite. - **No retry policy**: failed sync stays failed. → Exponential backoff; resume on connectivity. - **Token expiry during offline**: user logged out unexpectedly. → Long offline window; clear re-auth on return. - **All data synced for every user**: storage explodes; sync slow. → Scope (tenant; recent; visible). - **No "i lost this" recovery path**: data gone if device wiped. → Server is canonical; local is cache; loss recoverable from server if synced. ### Tooling stack (typical) | Concern | Tool | |---|---| | Local DB (web) | IndexedDB direct, Dexie, idb, RxDB | | Local DB (mobile) | SQLite, WatermelonDB, Realm | | Sync framework | PouchDB / CouchDB, Replicache, Electric SQL, PowerSync | | CRDT | Yjs (Y.js), Automerge, Loro | | Operation queue | Custom; or sync framework's built-in | | Service worker | Workbox | ### Adoption path 1. **Day 0**: no offline support. Online-required. Document. 2. **If offline needed**: pick scope (read-only offline first; then writes; then full offline). 3. **Choose strategy**: LWW for simple; CRDT for collaborative; manual UX for high-stakes. 4. **Per-record sync state in UI**. 5. **Test**: throttled / disconnected / re-connect cycles + multi-device. 6. **Drill**: conflict scenarios; data loss recovery. ### See also - [`distributed-data-pattern.md`](./distributed-data-pattern.md) — eventual consistency. - [`event-streaming-pattern.md`](./event-streaming-pattern.md) — operation log + replay. - [`../security/session-mgmt-pattern.md`](../security/session-mgmt-pattern.md) — offline auth concerns. - [`../ui-ux/empty-states-pattern.md`](../ui-ux/empty-states-pattern.md) — offline as an empty/error state. - [`anti-overengineering.md`](./anti-overengineering.md) — adopt offline-first only with real need. ==== https://playbook.agentskit.io/docs/pillars/architecture/platform-engineering-idp-pattern --- title: 'Platform Engineering + Internal Developer Platform Pattern' description: 'How to build the layer between cloud / infra and product engineers, so product teams ship fast without re-learning infra each time.' --- # Platform Engineering + Internal Developer Platform Pattern How to build the layer between cloud / infra and product engineers, so product teams ship fast without re-learning infra each time. ## TL;DR (human) Platform engineering productises infrastructure for internal developers. The deliverable is an Internal Developer Platform (IDP) — a curated set of golden paths (templates, paved roads, self-service tools) that let teams ship without becoming infra experts. Tracked metrics: lead time, deploy frequency, change failure rate, time to recover (DORA). The platform team's customer is the product team. ## For agents ### What an IDP includes | Surface | What it provides | |---|---| | **Service templates** | `new-service` scaffolds: code, CI, deploy, observability wired | | **Self-service deploy** | "I want to ship this" → one command / PR | | **Self-service env** | "I need staging for my PR" → ephemeral env automatically | | **Service catalog** | "Where does X live?" → searchable inventory (Backstage et al) | | **Observability defaults** | dashboards, alerts, SLOs scaffolded per service | | **Secrets self-service** | "I need a new vault entry" → request + audit | | **Documentation hub** | API specs, ADRs, runbooks indexed | | **Cost visibility** | per-team / per-service spend dashboards | The platform is a **product**. Product engineers are customers; survey them. ### Golden paths A golden path is the easy, paved way to do a common thing — vs the unpaved way, which is allowed but offers no support. Examples: - New microservice: scaffold → live in 30 minutes. - Add a new background job: existing job-runner abstraction. - Add a new RPC method: existing dispatcher + schema package. - Add a new database: managed RDS via IaC. Going off-road is allowed. Going off-road for everything = the platform isn't paving real needs. ### Self-service ≠ unsupervised Self-service means a product engineer can do it without filing a ticket. It doesn't mean unreviewed: - Pre-flight checks (cost, security review, capacity). - Audit trail of who provisioned what. - Default-secure choices baked in. - Auto-rollback on health-check failure. Self-service + guardrails = velocity + safety. Self-service without guardrails = chaos. Guardrails without self-service = bottleneck. ### DORA metrics Per-team or per-service measurement (DevOps Research and Assessment): | Metric | Definition | Elite team target | |---|---|---| | **Deploy frequency** | How often code reaches prod | Multiple per day | | **Lead time for changes** | Commit → production | < 1 day | | **Change failure rate** | % of deploys causing incident | 0-15% | | **Time to restore** | Incident → resolved | < 1 hour | These metrics drive platform investment. The platform team's goal is to move every product team toward elite. ### Service catalog Per service: - Name + description. - Owning team. - On-call info. - Source repo. - Documentation links (ADRs, RFCs, runbooks). - Dependencies (which services this depends on; which depend on this). - Tier (critical, important, supporting). - Compliance tags. Tool: Backstage, Cortex, OpsLevel, in-house. The catalog answers "who do I ask?" + "what depends on this?" + "is this safe to change?". ### Templates + scaffolds A `create-\` CLI / template: ```bash $ npx create-service my-new-service --type=node-ts-api # scaffolds: # - source skeleton # - package.json with workspace conventions # - Dockerfile + CI workflow # - terraform module # - observability defaults # - README + ADR template ``` Templates encode conventions (per [`universal.md`](./universal.md), [`ts-concrete.md`](./ts-concrete.md)) so new services start compliant. Templates evolve; old services migrate via a separate "update-service" tool. ### Ephemeral environments Per-PR preview environments: - Triggered automatically by PR open. - Live URL posted to PR. - Resources auto-torn-down on PR close (or N days idle). - Cost-attributed to the PR author / team. Lets reviewers + designers + product see real changes without staging churn. ### Capacity + cost guardrails Self-service is dangerous without guardrails: - Per-team budget caps. - Auto-tear-down of idle resources. - Required tags (per [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md)). - New service request → reviewed if cost > threshold. ### Documentation as a platform feature A docs portal aggregates: - Per-service READMEs (Backstage-rendered). - API references (OpenAPI / GraphQL schemas / RPC method indexes). - ADRs / RFCs. - Runbooks. - Tutorials / quickstarts. - Search. Engineers find docs in seconds, not minutes. Search quality matters more than doc volume. ### Platform team's customer Product engineers. Their satisfaction is the platform team's KPI: - Onboarding time: new engineer → first PR shipped. - Time to spin up a new service. - "How happy are you with the platform?" — quarterly survey. - Support tickets: their volume + topics. Anti-pattern: platform team optimises for its own elegance, ignores adopters' pain. ### Inner-source contributions Product teams contribute back to the platform: - A team builds a niche helper; useful broadly; promote into platform. - A team finds a bug in a template; patches it; gets credit. - Codeowners model: platform team owns merging; community drives PRs. Healthy platforms grow from the edges, not from the center alone. ### Anti-pattern: the gatekeeper platform Symptoms: - Every product change requires a platform ticket. - "Wait 2 weeks for the platform team to enable this." - Workarounds proliferate. - Product teams build shadow infrastructure. Cure: more golden paths; more self-service; fewer tickets. ### Common failure modes - **No platform team; everyone reinvents**. Drift; bus factor low. → Form a platform team when the org passes ~30 engineers. - **Platform team builds without users**. Adoption zero. → Adopt-first design. - **Self-service without guardrails**. Cost / security incidents. → Guardrails baked in. - **Old services don't migrate**. Platform forks; legacy slows. → Migration tooling; deprecation. - **Templates rot**. New service uses old template; conventions wrong. → Templates own conventions; CI verifies. - **No service catalog**. Org-wide knowledge in heads. → Backstage or equivalent. ### Tooling stack (typical) | Concern | Tool | |---|---| | Service catalog | Backstage, Cortex, OpsLevel, Compass | | IaC | Terraform, Pulumi, CDK, Crossplane | | Templates | cookiecutter, plop, Backstage Software Templates | | Ephemeral envs | Vercel preview deploys, Render, Fly.io, custom on k8s | | Self-service portals | Backstage, in-house | | DORA metrics | Faros, LinearB, in-house from CI + deploy logs | ### Adoption path 1. **< 30 engineers**: no platform team needed; shared playbook. 2. **30-100**: first platform engineer; service catalog; templates. 3. **100-300**: platform team of 3-8; self-service deploy; DORA tracking. 4. **300+**: full IDP; multiple platform sub-teams (compute, data, security, observability). Don't form a platform team too early; you have nothing to platform yet. ### See also - [`anti-overengineering.md`](./anti-overengineering.md) — premature platform = canonical overengineering. - [`../quality/ci-cd-pipeline-pattern.md`](../quality/ci-cd-pipeline-pattern.md) — platform owns the pipeline. - [`../quality/cost-optimization-pattern.md`](../quality/cost-optimization-pattern.md) — platform owns cost attribution. - [`../governance/universal.md`](../governance/universal.md) — platform changes go through ADR / RFC. ==== https://playbook.agentskit.io/docs/pillars/architecture/rfc-pattern --- title: 'RFC Pattern' description: 'When the decision is bigger than your team. Used for changes to **public contracts**, wire formats, plugin protocols, and any breaking change a consumer outside this repo would notice.' --- # RFC Pattern When the decision is bigger than your team. Used for changes to **public contracts**, wire formats, plugin protocols, and any breaking change a consumer outside this repo would notice. ## TL;DR (human) An RFC is an ADR with a **review window** and a **migration plan**. You publish the proposal, collect feedback for N days, decide, and on acceptance promote it to an ADR. The RFC is the *negotiation*; the ADR is the *contract*. ## For agents ### When to write an RFC (instead of an ADR) Write an RFC when **any** of the following is true: - The change breaks an existing public method signature, schema, wire format, or stable error code. - The change adds a new top-level config field consumers must set. - The change adds a new package that other repos / plugins are expected to depend on. - The change modifies the plugin / extension protocol. - The change requires a versioned migration in consumer code. Internal-only refactors → ADR is enough. Anything a consumer can observe → RFC. ### Sections Use [`../../templates/RFC.template.md`](../../templates/RFC.template.md). Required: 1. **Summary** — one paragraph; what changes. 2. **Motivation** — what is the problem; what use cases motivate this. 3. **Detailed design** — the proposal in enough detail that a reviewer can spot pitfalls. 4. **Backwards compatibility** — what breaks; what migration path consumers have; deprecation window if any. 5. **Migration plan** — for the codebase itself: codemod, sweep, gates flipped on. 6. **Drawbacks** — what is worse after. 7. **Alternatives** — at least two non-trivial alternatives. 8. **Unresolved questions** — closes before acceptance. ### Lifecycle ``` Draft → Open (review window starts) → Final-Comment-Period → ↘ Accepted → promoted to ADR → implementation ↘ Rejected → kept on disk for record ↘ Withdrawn → kept on disk ``` Conventions that worked in production: - **Review window**: 5 business days minimum for "minor" RFCs, 10 for breaking changes. - **Final-Comment-Period**: a 48-hour signal that no further changes are expected. Started by an explicit comment on the RFC PR. - **Acceptance** requires the maintainer of every affected package to thumbs-up. Agents can collect the thumbs-ups; only a human can be the final acceptor on a breaking change. ### Promotion to ADR On acceptance: 1. Open a PR that promotes the RFC to a numbered ADR. 2. The ADR's "Decision" section is the RFC's "Detailed design", trimmed of negotiation. 3. The ADR's "Rollout" section is the RFC's "Migration plan". 4. Link both ways: ADR cites the originating RFC; RFC's Status becomes "Accepted; promoted to ADR-NNNN". ### Gate Recommended automated checks: 1. **Index integrity** — `docs/rfc/README.md` lists every RFC with current status. 2. **Promotion linkage** — every accepted RFC has a matching ADR with the back-pointer. 3. **No orphan breakers** — no PR is allowed to change a method's params/result schema without referencing an accepted RFC (or an explicit `removes:` justification if the method is being deleted). Reference impl: [`../../scripts/check-rfc.example.mjs`](../../scripts/check-rfc.example.mjs). ### Common failure modes (sourced from production) - **Agent ships breaking change in a "small refactor" PR.** The schema is now incompatible; downstream consumers break in production. → Block any PR that mutates a method schema without an RFC reference. - **RFC merged the same day it was opened.** No time for cross-package review; surprises land a week later. → Enforce a minimum review window in the gate. - **RFC accepted but no ADR promotion.** Six months later, the RFC is buried; future agents miss it during retrieval. → Promotion is part of acceptance, not a follow-up. - **Multiple RFCs editing the same surface in parallel.** Conflicts surface only in implementation. → A "currently open RFCs touching X" map in the index lets agents notice the collision before drafting. ### See also - [`adr-pattern.md`](./adr-pattern.md) — the destination format. - [`../governance/README.md`](../governance/README.md) — merge rules for breaking changes. - [`../../templates/RFC.template.md`](../../templates/RFC.template.md) — copy-paste skeleton. ==== https://playbook.agentskit.io/docs/pillars/architecture/service-mesh-pattern --- title: 'Service Mesh Pattern' description: 'How to handle cross-service communication concerns (mTLS, retries, observability, routing) without coding them in every service.' --- # Service Mesh Pattern How to handle cross-service communication concerns (mTLS, retries, observability, routing) without coding them in every service. ## TL;DR (human) A service mesh injects a sidecar proxy alongside every service. The proxy handles cross-cutting: mTLS, retries, timeouts, traffic shaping, metrics, traces. Services talk to localhost; mesh handles the rest. Powerful but operationally heavy — only adopt when service count + complexity justify. ## For agents ### What a mesh provides | Concern | Mesh-provided | |---|---| | mTLS between services | ✓ | | Retries with backoff | ✓ | | Timeouts | ✓ | | Circuit breaking | ✓ | | Load balancing | ✓ | | Traffic shaping (% canary) | ✓ | | Metrics (golden signals per service-pair) | ✓ | | Distributed traces | ✓ (auto-instrumented) | | Access control between services | ✓ | | Rate limiting (cross-service) | ✓ | Without a mesh, each service implements (badly, inconsistently) most of these. ### Cost Operational cost: - Sidecar per pod = N× memory + CPU overhead. - Control plane is itself a system to operate. - Debugging adds a layer (is it the service, the sidecar, the network?). - Upgrade cadence (mesh version drift = pain). Adopt when service count > ~20 + the cross-cutting concerns are recurring pain. Smaller fleets: pick libraries instead. ### The sidecar model ``` ┌──────────────────────┐ │ Pod │ │ ┌────────┐ ┌───────┐ │ │ │service │←│sidecar│←──── traffic │ │ app │→│ proxy │ │ │ └────────┘ └───────┘ │ └──────────────────────┘ ``` App talks to localhost. Sidecar handles outbound and inbound traffic. Control plane configures sidecars. ### mTLS Mutual TLS between services: - Every service has a cert (issued by mesh). - Every connection authenticates both sides. - Encryption + identity baked in. Without mTLS, internal traffic is plain. A network compromise reads it. mTLS makes lateral movement much harder. ### Traffic management Mesh enables: - **Canary deploy**: route 1% to new version; promote on metrics. - **A/B routing**: header-based; serve different versions per condition. - **Blue-green**: route 100% to new; rollback by re-routing. - **Mirror traffic**: send shadow copy to new version (no response used). - **Fault injection**: deliberately add latency / errors to test resilience. These are tools — not features the app needs to implement. ### Retry + timeout discipline Each service-to-service hop has policy: ```yaml - destination: user-service retries: 3 retry-on: 5xx,connect-error timeout: 5s per-try-timeout: 1s circuit-breaker: consecutive-errors: 5 interval: 30s ``` Policy in mesh config, not in code. Adjusted operationally without app deploy. Risk: retry storms — every layer retries; downstream amplification = upstream's incident becomes catastrophic. Configure with awareness; sometimes the right answer is *don't retry*. ### Observability gain Mesh gives golden signals per service-pair for free: - Request rate. - Error rate. - Duration (p50/p95/p99). - Plus traces auto-instrumented. Reduces per-service instrumentation burden. Cost: high-cardinality metric storage. ### Access policy Cross-service auth: - "Service A can call service B" (allow). - "Service C cannot call service B" (deny). - "Service A can call /users/* on B but not /admin/*". Policy in mesh. Service implementations don't need to verify caller — mesh does it. Useful for compliance: prove via policy that only authorised services touch sensitive data. ### Multi-cluster + multi-region Mature mesh deployments span clusters / regions: - One mesh control plane manages multi-cluster. - Service "x in cluster A" calls "x in cluster B" if local fails. - Cross-cluster traffic also mTLS. Operational complexity rises. Worth it for global products; overkill for single-region. ### When NOT to adopt - Service count < 10 — overhead exceeds benefit. - Single-team — no need for cross-team policy enforcement. - Team without operational expertise — mesh failures are subtle. - Latency-sensitive (sub-ms RPCs) — sidecar adds ~1ms. Libraries (resilience4j, Polly) cover most resilience needs at < 5 services. ### Lighter-weight alternatives | Need | Lighter alternative | |---|---| | mTLS only | SPIFFE / SPIRE without full mesh | | Retries + circuit-break | Application-level libraries | | Observability | OpenTelemetry SDKs | | Routing | Application-level service discovery (Consul, Eureka) | Get the value piece by piece without committing to full mesh. ### Common failure modes - **Premature adoption**. Mesh installed; team doesn't understand failure modes; outage. → Demonstrated need first. - **Sidecar OOM**. Sidecar runs out of memory; service unreachable. → Resource limits per sidecar; alerts. - **Configuration drift**. Mesh config differs from intent. → GitOps-managed. - **Retry amplification**. Mesh retries; service also retries; upstream sees N² requests. → One layer retries. - **Latency budget consumed by mesh**. Sidecar adds ms; user-facing budget breached. → Profile; tune mesh. - **Debugging "where is the failure"**. Service blames sidecar; sidecar blames service. → Tracing must traverse both. - **Upgrade fear**. Mesh version stale; security patches lag. → Operational discipline; staging-mesh first. ### Tooling stack (typical) | Mesh | Notes | |---|---| | **Istio** | Most feature-rich; operational complexity | | **Linkerd** | Simpler; Rust sidecar; smaller surface | | **Cilium Service Mesh** | eBPF-based; sidecar-less option | | **Consul Connect** | HashiCorp's; integrates with their ecosystem | | **AWS App Mesh** | Cloud-managed; AWS-specific | | **Open Service Mesh** | CNCF; less mature, less active | For new adopters: Linkerd if minimal; Istio if mature ops team; Cilium for k8s-native. ### Adoption path 1. **< 10 services**: libraries. 2. **10-30 services + cross-team needs**: introduce mesh in one part of the system; learn ops. 3. **30+ services**: full mesh adoption; multi-cluster considered. 4. **Multi-region**: mesh spans regions. ### See also - [`api-gateway-pattern.md`](./api-gateway-pattern.md) — gateway for external; mesh for internal. - [`multi-region-pattern.md`](./multi-region-pattern.md) — mesh in multi-region. - [`anti-overengineering.md`](./anti-overengineering.md) — mesh is the canonical premature complexity. - [`../quality/observability-pattern.md`](../quality/observability-pattern.md) — mesh contributes signals. - [`../security/rbac-pattern.md`](../security/rbac-pattern.md) — service-to-service auth interplay. ==== https://playbook.agentskit.io/docs/pillars/architecture/ts-concrete --- title: 'Architecture — TS / Node ≥22 / pnpm Monorepo (Concrete)' description: 'Copy-paste-ready recipes that implement [`universal.md`](./universal.md) on a TypeScript stack. Calibrated on a real multi-package, multi-app monorepo built primarily by AI agents over ~1 year.' --- # Architecture — TS / Node ≥22 / pnpm Monorepo (Concrete) Copy-paste-ready recipes that implement [`universal.md`](./universal.md) on a TypeScript stack. Calibrated on a real multi-package, multi-app monorepo built primarily by AI agents over ~1 year. ## TL;DR (human) - pnpm workspaces + Turbo for the monorepo wiring. - One `core` package that owns Zod schemas, the error class hierarchy, and the event bus. Hard gzipped budget (calibrate per project; ~25 KB works for a 30-package monorepo). - Strict TypeScript everywhere: `"strict": true`, `noUncheckedIndexedAccess: true`, no `any`, named exports only. - Zod parses every HTTP / JSON-RPC / IPC / file-IO boundary. - `AppError` subclasses with `\_\` codes; thrown only via `throw new AppError(...)`-style classes; raw `new Error` is lint-banned at boundary files. - Sub-path package layout (RFC-driven) so each package can ship multiple entry points without circular imports. ## For agents ### Topology ``` repo/ ├─ packages/ │ ├─ core/ # Zod schemas, errors, event bus. <25 KB gz. No internal deps. │ ├─ contracts/ # JSON-RPC method registry + dispatcher. Depends on: core. │ ├─ log/ # createLogger(tag), transports. Depends on: core. │ ├─ storage/ # Persistence stores. Depends on: core, log. │ ├─ runtime/ # Execution layer. Depends on: core, contracts, storage, log. │ ├─ ui/ # Shared UI primitives. Depends on: core (types only). │ └─ / # One per cohesive feature surface. ├─ apps/ │ ├─ desktop/ # End-user app. Consumes packages. │ ├─ web/ # Marketing + docs. │ └─ cloud/ # Control plane. ├─ docs/ │ ├─ adr/ # Decisions (accepted = source of truth). │ ├─ rfc/ # In-flight. │ └─ for-agents/ # RAG-indexed per-package + per-screen + per-flow refs. ├─ AGENTS.md # Top-level routing table (which package owns what). ├─ CLAUDE.md # Non-negotiables mirror for AI agents. └─ pnpm-workspace.yaml ``` ### Workspace files `pnpm-workspace.yaml`: ```yaml packages: - "packages/*" - "apps/*" ``` `turbo.json` (Turborepo): cache `build`, `test`, `lint`, `typecheck`. Make `check:all` depend on each. `tsconfig.base.json`: ```json { "compilerOptions": { "target": "ES2022", "module": "ESNext", "moduleResolution": "Bundler", "strict": true, "noUncheckedIndexedAccess": true, "noFallthroughCasesInSwitch": true, "noImplicitOverride": true, "exactOptionalPropertyTypes": true, "isolatedModules": true, "resolveJsonModule": true, "skipLibCheck": true, "esModuleInterop": false } } ``` Per-package `tsconfig.json` extends this and adds `references` for incremental builds. ### `package.json` rules - `"type": "module"` everywhere. - `"exports"` map with explicit sub-paths. No barrel-only packages. - `"sideEffects": false` unless you actually rely on import side effects. - `peerDependencies` for cross-cutting concerns (e.g. `zod`, `react`) so consumers pin one copy. Example: ```json { "name": "@app/core", "type": "module", "exports": { ".": "./dist/index.js", "./errors": "./dist/errors/index.js", "./schemas": "./dist/schemas/index.js", "./events": "./dist/events/index.js" }, "sideEffects": false, "peerDependencies": { "zod": "^3" } } ``` ### Sub-path layout (post-monolith-barrel) When a package grows past ~5 cohesive concerns, split its public surface into sub-paths: ``` packages/core/ ├─ src/ │ ├─ errors/ # Error classes + code constants. Exported via "./errors". │ ├─ schemas/ # Zod schemas. Exported via "./schemas". │ ├─ events/ # Event bus types. Exported via "./events". │ └─ index.ts # Re-exports the public surface from each subdir. └─ package.json # exports map per subdir. ``` Why: consumers import only what they need; tree-shaking works even without `sideEffects:false`; agents can reason about a sub-path without loading the whole package. ### Named exports only `.eslintrc.cjs`: ```js module.exports = { rules: { "import/no-default-export": "error", }, overrides: [ { // Next.js App Router + config files require default exports. files: [ "apps/web/app/**/{page,layout,loading,error,not-found,template}.tsx", "apps/web/app/**/route.ts", "**/{tailwind,next,vitest,vite,playwright}.config.*", ], rules: { "import/no-default-export": "off" }, }, ], }; ``` ### No `any` enforcement `.eslintrc.cjs` (additive): ```js "@typescript-eslint/no-explicit-any": "error", "@typescript-eslint/no-unsafe-assignment": "error", "@typescript-eslint/no-unsafe-call": "error", "@typescript-eslint/no-unsafe-member-access": "error", "@typescript-eslint/no-unsafe-return": "error", ``` Escape hatch: `// allow-any: \` line comment. Lint allows it; a separate gate counts these and fails if the count grows. See [`../../scripts/`](../../scripts/). ### Zod at every boundary ```ts // packages/contracts/src/methods/example.ts import { z } from "zod"; export const ExampleParams = z.object({ id: z.string().min(1), limit: z.number().int().positive().max(100).default(20), }); export const ExampleResult = z.object({ rows: z.array(z.object({ id: z.string(), name: z.string() })), }); export const exampleMethod = { method: "example.list", params: ExampleParams, result: ExampleResult, requireAuth: true, } as const; ``` Handler: ```ts import { AppError } from "@app/core/errors"; import { ExampleParams, ExampleResult } from "@app/contracts/methods/example"; export async function exampleHandler(rawParams: unknown) { const params = ExampleParams.parse(rawParams); // throws ZodError on bad input // ... business logic return ExampleResult.parse(result); // confirms our output matches the contract } ``` Dispatcher converts `ZodError` to `AppError({ code: "VALIDATION_ERROR", ... })`. ### Errors ```ts // packages/core/src/errors/app-error.ts export class AppError extends Error { constructor( readonly code: string, message: string, readonly opts: { readonly hint?: string; readonly docsUrl?: string; readonly cause?: unknown; } = {}, ) { super(message, { cause: opts.cause }); this.name = "AppError"; } } // Subclasses by namespace: export class AuthError extends AppError {} export class ValidationError extends AppError {} export class NotFoundError extends AppError {} // ... etc. ``` Codes live in one file: ```ts // packages/core/src/errors/codes.ts export const ERROR_CODES = { AUTH_REQUIRED: "AUTH_REQUIRED", AUTH_FORBIDDEN: "AUTH_FORBIDDEN", VALIDATION_ERROR: "VALIDATION_ERROR", NOT_FOUND: "NOT_FOUND", HANDLER_THREW: "HANDLER_THREW", // ... } as const; export type ErrorCode = (typeof ERROR_CODES)[keyof typeof ERROR_CODES]; ``` Lint rule (custom or `no-restricted-syntax`) bans `throw new Error(` in `**/methods/**` and `**/handlers/**` directories. Escape hatch: typed subclass. ### Logger ```ts // packages/log/src/index.ts export function createLogger(tag: string) { return { info: (msg: string, fields?: Record) => write("info", tag, msg, fields), warn: (msg: string, fields?: Record) => write("warn", tag, msg, fields), error: (msg: string, fields?: Record) => write("error", tag, msg, fields), debug: (msg: string, fields?: Record) => write("debug", tag, msg, fields), }; } ``` Lint bans `console.log` / `console.warn` / `console.error` repo-wide except in `scripts/` (build-time tooling) and tests. ### Size budgets (gate) Reference impl in [`../../scripts/check-file-size.example.mjs`](../../scripts/check-file-size.example.mjs). Mode: **shrink-only baseline**. A JSON baseline lists every file currently over budget. New files must be under budget; baselined files must not grow. ### Hard size gate on `core` ```bash pnpm --filter @app/core build gzip -c packages/core/dist/index.js | wc -c # fail if > 25600 ``` Wire into `check:all`. ## Checklist when standing up a new package 1. Add to `pnpm-workspace.yaml`. 2. `package.json` with `type: module`, `exports` map, `sideEffects: false`. 3. `tsconfig.json` extends base, adds `references` to deps. 4. `src/index.ts` re-exports the public surface only. 5. `src/__tests__/` next to source, not in a top-level `test/` dir. 6. Add the package to the `AGENTS.md` routing table. 7. Add a one-pager doc in `docs/for-agents/packages/\.md` (template in [`../../templates/`](../../templates/)). 8. If the package owns persistence, register its schema with the storage layer + the contract registry. ## See also - [`contracts-zod-pattern.md`](./contracts-zod-pattern.md) — JSON-RPC + Zod registry deep dive. - [`error-hierarchy.md`](./error-hierarchy.md) — full error model + serializer. - [`file-size-budget.md`](./file-size-budget.md) — baseline gate calibration. - [`../quality/README.md`](../quality/README.md) — wiring the gates into CI. ==== https://playbook.agentskit.io/docs/pillars/architecture/universal --- title: 'Architecture — Universal Principles' description: 'Stack-agnostic. Applies to any language, any framework.' --- # Architecture — Universal Principles Stack-agnostic. Applies to any language, any framework. ## TL;DR (human) Six rules. They scale from a one-package project to a 30+ package monorepo. They are the price of admission for letting agents touch your code without supervision. 1. Name every boundary. If you cannot tell an agent "this code goes in package X", the boundary does not exist yet. 2. One contract package owns schemas + error model. Everything else depends on it; it depends on nothing internal. 3. Every change to that contract goes through a written decision (ADR / RFC). The document is the artifact. 4. No `any` / `dyn` / `interface{}` at any external boundary. Parse with a runtime schema. 5. Errors are typed with stable codes. The client (or another agent) can pattern-match without reading strings. 6. Files have size budgets enforced by a gate. Reviewability is a feature. ## For agents ### Rule 1 — Name every boundary Every directory that an agent might write code into must answer: "what is this for, and what is it not for?" in one sentence. - Maintain a top-level routing table mapping intent ("I want to change X") to location ("edit package Y"). Template: [`../../templates/AGENTS.md.template.md`](../../templates/AGENTS.md.template.md). - If two packages could plausibly own the same change, the boundary is wrong. Fix the boundary or merge the packages. - Group packages into 4–7 **logical groups** (e.g. "contracts + foundation", "runtime + flow", "security + collaboration"). Agents triage faster by group than by alphabetical name. **Failure mode prevented:** agents inventing new packages or piling code into the largest existing file because no rule said where it goes. ### Rule 2 — One contract package owns schemas + error model Pick one package (call it `core` or `contracts`). It contains: - All shared types / schemas (Zod, Pydantic, Protobuf — your choice). - The error class hierarchy + the central error-code constants. - Nothing else. No business logic. No I/O. Constraints: - This package has **no internal dependencies**. It can depend on `zod` and `std`, nothing else. - It has a hard size budget (e.g. 25 KB gzipped). The budget is a CI gate. Hitting the budget forces a real conversation about what belongs in the contract layer. **Failure mode prevented:** circular dependencies, schema drift between packages, agents copy-pasting "the same" schema with a one-field difference. ### Rule 3 — Decisions are written down before they ship Two artifacts: - **ADR** (Architecture Decision Record) — for choices that affect the codebase's shape. Numbered. Append-only. Status: Proposed / Accepted / Superseded / Tombstoned. Template: [`../../templates/ADR.template.md`](../../templates/ADR.template.md). - **RFC** (Request for Comment) — for choices that affect external contracts (public API, wire format, plugin protocol). Has a review window. Promotes to ADR when accepted. Template: [`../../templates/RFC.template.md`](../../templates/RFC.template.md). Rules: - An agent proposing a structural change without an ADR is proposing tech debt. Reject the PR; ask for the ADR first. - An agent proposing a breaking change without an RFC is proposing an unannounced break. Reject the PR; ask for the RFC. - ADRs/RFCs are the **change**. The code is the implementation of the change. **Failure mode prevented:** "we decided" without anyone able to find the decision; future agents reverting it because they cannot see why it was made. ### Rule 4 — Parse, don't validate, at every external boundary External boundary = anywhere bytes enter the process from outside the trust boundary: HTTP, IPC, JSON-RPC, file I/O, env vars, CLI args, message-bus payloads. - Define a schema. Parse the input. Use the parsed type. If parsing fails, raise a typed error with a stable code. - Do not type-cast unparsed input. Do not "trust the API contract on the other side". Parse. - The parsed type is the only type that flows into the rest of the system. Untyped data is sandboxed at the edge. **Failure mode prevented:** runtime errors deep in the system caused by an upstream caller's drift; agents writing handler code that assumes the wrong shape and breaks silently. ### Rule 5 — Typed errors with stable codes Define one base error class. Every other error in the system subclasses it. Each error has: - a stable string code (`NAMESPACE_REASON`, all-caps, snake-case), - a human message (intl-keyed in user-facing surfaces), - an optional `hint` (one-line suggestion to the caller), - an optional `docsUrl` (link to the error doc). Constraints: - Never `throw new Error('...')` at a boundary. Wrap in a typed subclass. - The dispatcher / HTTP layer serializes typed errors with `code` + `message` + `hint` + `docsUrl`. Unknown thrown errors become an opaque `INTERNAL_ERROR` — never leak stack traces or raw strings. - Codes are append-only. Renames go through an ADR. **Failure mode prevented:** clients pattern-matching on error message strings (which drift); agents inventing new error shapes per package; opaque failures in production. ### Rule 6 — File-size budgets Pick budgets per file kind. Enforce them in a gate. Example budgets (calibrated for TS/React; adjust per language): - View / component files: 300 lines. - Logic / module files: 500 lines. - Test files: 800 lines (often unavoidable for table-driven tests). Rules: - The gate is **baseline-shrink-only**: an existing file over budget is grandfathered, but new code in that file must shrink it; new files must respect the budget. - Hitting the budget = extract. Do not lower the budget to fit. Do not split into `\-2.\`. **Failure mode prevented:** files becoming unreviewable; agents losing context inside a 1500-line component; reviewers approving PRs they cannot read. ## See also - [`adr-pattern.md`](./adr-pattern.md) — how to write an ADR. - [`rfc-pattern.md`](./rfc-pattern.md) — when an ADR is not enough. - [`error-hierarchy.md`](./error-hierarchy.md) — the error model in detail. - [`file-size-budget.md`](./file-size-budget.md) — budget calibration + gate impl. - [`../governance/README.md`](../governance/README.md) — merge rules that protect these boundaries. ==== https://playbook.agentskit.io/docs/pillars/governance --- title: 'Pillar — Governance' description: 'How multiple agents (and humans) coordinate work in one repo without subtracting each other''s progress.' --- # Pillar — Governance How multiple agents (and humans) coordinate work in one repo without subtracting each other's progress. ## Status ◐ Scoped, not yet detailed. ## Scope | Concern | Universal principle | Concrete pattern | |---|---|---| | PR intent manifest | Every PR declares what it adds / removes / changes; reviewers verify against it | `pr-intent.yaml` parsed by a gate; renames + removes require explicit `removes:` entries | | Merge rules | Merges sum work, never subtract | Agents may not `git checkout --theirs/--ours` without `merge-override: \` annotation | | Concurrent-agent awareness | Other agents may be editing the same files | Session start: `git fetch`, recheck issue state, look for in-flight PRs touching the same paths | | One sub-unit per session | Big phases split into discrete, shippable sub-units | Sub-unit defined before starting; no scope creep mid-session | | Phased PR + admin merge | Long initiatives ship as a chain of phase PRs | `gh pr merge --merge --admin`, delete branch, continue off fresh main | | Removes-list | Listing removed exports forces intentionality | Gate fails if a PR removes an exported symbol without a `removes:` entry | | Tombstones | Retire docs, plans, ADRs without losing trail | Prepend a 🪦 status block; keep the body | | Audit trail | Every privileged operation produces a signed ledger entry | See security pillar | | Verify-first close | Before fixing an issue, verify it's still open and not solved concurrently | `gh issue view \ --json state` at session start and again before push | ## Non-negotiables 1. **No silent deletions.** Removing another author's exported symbol requires a `removes:` entry + justification in PR intent. 2. **Decisions are written, not announced.** ADR or RFC; see architecture pillar. 3. **Tombstone, do not delete.** Trail beats clean. 4. **Verify before fixing.** Concurrent agents may have closed the issue already. 5. **One PR = one sub-unit.** No "while I'm here" expansions. ## See also - [`../architecture/adr-pattern.md`](../architecture/adr-pattern.md), [`../architecture/rfc-pattern.md`](../architecture/rfc-pattern.md) - [`../ai-collaboration/README.md`](../ai-collaboration/README.md) — agent-side discipline for these rules. - [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md) — the manifest skeleton. ## Roadmap - `universal.md` - `pr-intent-pattern.md` - `merge-rules-pattern.md` - `tombstone-pattern.md` - `phased-pr-pattern.md` ==== https://playbook.agentskit.io/docs/pillars/governance/merge-rules-pattern --- title: 'Merge Rules Pattern' description: 'How to resolve conflicts so the merge sums work instead of subtracting it.' --- # Merge Rules Pattern How to resolve conflicts so the merge sums work instead of subtracting it. ## TL;DR (human) Default: rebase, resolve hunk-by-hunk, keep both sides where they coexist. `git checkout --theirs/--ours` is almost never the right answer; when it is, document why with `merge-override:` in the PR-intent manifest. After any conflict resolution, re-run the affected tests — diffs that look clean can hide reordering bugs. ## For agents ### The hazards Conflicts happen where your work meets peer work. Two ways they go wrong: 1. **Silent subtraction.** You drop one side's change without noticing. 2. **Reordering bug.** The diff looks plausible; the runtime behavior is wrong because two changes had a dependency you missed. The merge-rules pattern targets both. ### Default protocol 1. **Rebase, do not merge.** `git rebase origin/main`, not `git merge main`. Linear history is easier to read; reviewers can see exactly what diverged. 2. **Resolve hunk-by-hunk.** For each conflict, read both sides. Decide: - Both contributions are needed → merge by hand, keeping both. - One is strictly newer / better → keep that one; the other was a partial step. - They are incompatible → stop and ask. Do not pick blindly. 3. **Run tests on every step of a multi-commit rebase.** Not just at the end. A reordered commit can pass at the tip and fail in the middle, which breaks `git bisect` later. 4. **Open the diff against `origin/main` and re-read.** Confirm the diff is what you expect. ### `--theirs` / `--ours` policy Almost never the right answer. Their semantics: - `--ours` keeps the side currently checked out. During a rebase, that is **the upstream** (because you're replaying yours onto theirs). - `--theirs` keeps the incoming side. During a rebase, that is **your work**. The flipped semantics during rebase trip up agents repeatedly. Avoid both unless you can clearly state which side wins and why. When you do use them, the PR-intent manifest must have: ```yaml merge-override: "Explanation: which side was dropped, why dropping it was safe, what compensating change (if any) was needed." ``` The gate fails the PR if these flags appear in the diff (detectable via `git rerere` cache or by analysing the commit message) without the annotation. ### Removes-list discipline When the merge involves a delete-vs-edit conflict: - If the delete wins → the manifest needs a `removes:` entry naming the symbol or file. - If the edit wins → the manifest needs nothing extra, but the reviewer must verify the edit is still needed. Pattern that recurs in production: agent A renames a symbol; agent B edits the old name; merge result keeps the old name **and** the rename in different files. The build breaks. → After any rename conflict, search-and-verify all references are updated consistently. ### Conflict patterns and resolutions | Pattern | Right move | |---|---| | Both sides added independent lines in the same file | Keep both, order them sensibly | | Both sides modified the same line | Read both; pick the union if both intents matter; pick the newer one if it supersedes | | One side deleted a file the other side edited | Read why each side did it; usually the delete wins after the edit is migrated elsewhere | | Both sides renamed the same file to different names | Stop; this is a design conflict, not a merge conflict | | One side reformatted; the other side changed behavior | Reformatting wins for the lines it touched; behavior wins for the lines it touched; manual merge per hunk | | Auto-generated file conflict (lockfile, codegen) | Regenerate from scratch on the resolved state; do not hand-merge | ### After the resolution 1. **Tests.** Run them. All of them in scope. Conflicts can pass the build and fail behavior tests. 2. **Lint + structural gates.** Run them. Conflicts can break invariants without producing syntax errors. 3. **`git diff origin/main`.** Read it. Confirm it matches your manifest claims. 4. **PR description.** Update if scope changed. Mention the conflict resolution in a comment if it was non-trivial. ### Common failure modes - **Used `--theirs` without thinking about what "theirs" means during rebase.** Dropped your own work. → Slow down; rebase semantics are inverted from merge semantics. - **Hand-merged a lockfile.** Now `pnpm install` fails for everyone. → Regenerate; never hand-merge generated files. - **Resolved by accepting peer's whole file.** Discarded valid local changes. → Accept-by-file is an emergency tool, not a strategy. - **Pushed without re-running tests.** CI red; reviewer's time burned. → Always re-run after conflict resolution. - **No `merge-override:` annotation when one was needed.** PR-intent gate fails; restart resolution. → Annotate at the moment you use the flag; do not "add it later". ### See also - [`universal.md`](./universal.md) — Rule 2. - [`pr-intent-pattern.md`](./pr-intent-pattern.md) — `merge-override:` field. - [`../ai-collaboration/concurrent-agent-pattern.md`](../ai-collaboration/concurrent-agent-pattern.md) — defensive checklists. ==== https://playbook.agentskit.io/docs/pillars/governance/phased-pr-pattern --- title: 'Phased PR Pattern' description: 'How to ship initiatives too big for one PR without ending up with a long-lived branch hell.' --- # Phased PR Pattern How to ship initiatives too big for one PR without ending up with a long-lived branch hell. ## TL;DR (human) Split big initiatives into phases. Each phase is independently shippable: passes gates, is reviewable end-to-end, can ship without later phases. Merge each phase before opening the next. Fork the next phase from fresh main, not from the previous phase's branch. ## For agents ### Why phase A mega-PR fails three ways: 1. **Unreviewable.** Past ~500 LOC of meaningful change, reviewers either skim or punt. 2. **Long-lived conflicts.** Every day the branch is open, main moves and conflicts compound. 3. **All-or-nothing.** If phase 3 of the plan turns out to be wrong, you cannot land phases 1 and 2 cleanly. Phasing solves all three at once. ### Splitting strategy Three good axes for splitting: 1. **By layer.** Phase 1: schemas + types. Phase 2: stores. Phase 3: handlers. Phase 4: UI. Each phase compiles on its own with stub adapters at the next layer. 2. **By surface.** One package per phase. Cross-cutting refactor → one PR per affected package. 3. **By risk.** Phase 1: low-risk foundation. Phase N: the controversial change. Pick the axis that minimises cross-phase coupling. If phases need each other to compile, you split wrong. ### Per-phase rules Each phase: - Is one PR. - Has its own PR-intent manifest (sub-unit references the same parent issue with `· phase N` suffix). - Passes gates independently. The repo is shippable after each merge. - Forks from **current main** at PR-open time, not from the previous phase's branch. - Lands behind a feature flag if the partial state is not yet user-facing. ### Tracker issue The parent issue lists all phases with status: ```markdown - [x] Phase 1 — schemas + types — PR #1234 - [x] Phase 2 — stores — PR #1245 - [ ] Phase 3 — handlers - [ ] Phase 4 — UI - [ ] Phase 5 — feature-flag flip ``` Update the parent issue at the start and end of each phase. This is the canonical "where are we" doc. ### Merge cadence Default: merge each phase with `gh pr merge --merge --admin` (or your equivalent) **after gates pass on the rebased branch**. Then delete the branch. Then fork the next phase off fresh main. Why `--admin`: phased work often has tight dependencies between phases; you do not want the next phase blocked on a slow reviewer pinging an approval into a now-stale branch. Admin merge is appropriate **when gates are green** — never as an override of failing gates. Why fresh main each phase: the previous phase's branch carries baggage (its commits, its conflict-resolution state). Forking from main resets the slate, avoiding cumulative drift. ### Feature flags If a partial-state phase is observable to users, gate the new behavior behind a flag, default off. Phases 1–N add the behavior; the final phase flips the default. Two benefits: - Each phase is shippable to production without exposing half-built UX. - The flag flip is itself a trivial PR, reversible if something goes wrong. ### Cross-phase dependencies Sometimes phase N+1 needs a change to a contract introduced in phase N. Handle by: 1. Land phase N. The new contract is available. 2. Open phase N+1 against current main (which now has the new contract). Never open phase N+1 while phase N is still in review. The merge order matters. ### When phasing goes wrong - **Phases are too small.** A 30-line PR per phase + 20 phases = review overhead dominates the work. → Merge adjacent phases when they share a reviewer. - **Phases are too big.** Each phase is 1500 LOC. → Re-split. The axis was wrong. - **Phase N+1 cannot ship without phase N+2.** → Phases are coupled; you did not split well. Re-plan. - **Branch sits open for a week between phases.** → Cadence too slow; conflicts compound. Aim for one phase per session. ### Long-lived parent branches Some teams use a long-lived `epic/\` branch with phase PRs merging into it, then a final mega-merge to main. **Do not do this.** Reasons: - The mega-merge is unreviewable again. - The epic branch conflicts with main as main moves. - You lose the gate-per-phase discipline. Phases merge directly to main. Each phase carries its own gate signal. ### Common failure modes - **Phase 2 opened while phase 1 still under review.** Conflict storm when phase 1 lands. → Strict serial: merge, then open next. - **No feature flag on a half-built UI surface.** Users see broken state in prod. → Flag mid-build; flip in the final phase. - **Parent issue not updated.** Reviewers can't tell which phase is current. → Update parent issue at every phase start + end. - **Phases not independently testable.** Phase 1 has no behavior without phase 3; can't write a meaningful test. → Phase contract is "the system still compiles + existing tests still pass". Adding tests for the new behavior happens in the phase that adds the behavior. ### See also - [`universal.md`](./universal.md) — Rule 8. - [`../ai-collaboration/universal.md`](../ai-collaboration/universal.md) — Rule 5 (one sub-unit per session). - [`merge-rules-pattern.md`](./merge-rules-pattern.md) — for the rebase-each-phase step. ==== https://playbook.agentskit.io/docs/pillars/governance/pr-intent-pattern --- title: 'PR Intent Pattern' description: 'The manifest that makes a PR''s claims verifiable.' --- # PR Intent Pattern The manifest that makes a PR's claims verifiable. ## TL;DR (human) Every PR carries a structured manifest. The manifest says what the PR adds, changes, removes, tests, documents, and which gates it expects green. A gate parses the manifest and cross-checks the diff. The manifest IS the contract — the diff is the implementation. ## For agents ### Why a manifest Without one: - Reviewers read the PR description, scan the diff, and *hope* the two match. - Renames look like delete-plus-add — peer work disappears silently. - "Refactor" PRs grow to include behavior changes that get reviewed as if they were cosmetic. - Quality gates that should run for a UI change skip because the PR was labeled `refactor`. The manifest forces the agent to state the claim before opening review. The gate enforces the claim. ### Where the manifest lives Two valid placements: 1. **Embedded YAML in PR description** (between `\`\`\`yaml ... \`\`\``). Simpler — no extra file. Gate parses the description body. 2. **`pr-intent.yaml` file in the diff** (typically at repo root, deleted-on-merge or `.gitignored`-on-merge). Lets the manifest survive review revisions cleanly. Pick one. Mixing is worse than either. ### Manifest schema Full template: [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md). Required fields: ```yaml intent: summary: "One sentence imperative voice" pillar: architecture | security | ui-ux | quality | governance | ai-collaboration phase: discover | design | build | test | ship | operate sub-unit: " · " type: feat | fix | refactor | docs | test | chore | adr | rfc adds: - "" changes: - "" removes: - symbol: "" justification: "" tests: - "" docs: - "" gates: - lint - typecheck - unit - structural - "" ``` ### Gate checks Reference impl: [`../../scripts/check-pr-intent.example.mjs`](../../scripts/check-pr-intent.example.mjs). The gate enforces: 1. **Well-formed.** Parse fails → PR fails. 2. **All required fields present.** 3. **`removes:` matches diff.** Every exported-symbol deletion in the diff is listed. Every listed removal exists in the diff. 4. **`adds:` matches diff.** Every new exported symbol is listed. 5. **`gates:` are all green.** If `gates:` lists `structural` and the structural gate failed, the PR fails. 6. **Sub-unit references a real issue.** `gh issue view \` returns; state is `open` (or `closed` if this PR is the closer). 7. **`type:` matches change pattern.** A PR typed `docs` that modifies `src/**/*.ts` fails — type mismatch. ### Reviewer workflow The reviewer: 1. Reads the manifest. Understands the claim. 2. Reads the diff. Confirms it matches the claim. 3. Reads the tests. Confirms they cover the claim. 4. Reads the docs. Confirms they reflect the claim. If the diff does something the manifest does not claim, the diff is wrong **or** the manifest is wrong. Either way, the PR is not yet ready. ### Common failure modes - **`removes:` empty when diff deletes peer-authored symbols.** Silent revert. → Gate detects exported-symbol removals against `git blame`; requires manifest entry. - **`summary` is marketing copy.** "Improve user experience" — not actionable. → Lint summary for verbs: `add | fix | refactor | rename | remove | document | test`. - **Renames as delete + add.** Looks like a remove + add; in fact it's a rename. → Gate detects similar-content pairs and asks: rename or genuine removal? - **`gates: []` (skipped).** Agent disables gates to merge faster. → Gate config has a hard-coded minimum set that cannot be removed. - **`sub-unit` lists multiple issues.** PR is doing too much. → Gate requires exactly one entry. ### Adoption path If your repo does not have manifests yet: 1. Write the template. Land it in `templates/`. 2. Make manifests *optional* for two weeks. Agents practice; reviewers learn the shape. 3. Make manifests *required* (gate fails without one) — for new PRs only. 4. After one month, make the cross-check (manifest-vs-diff) mandatory. 5. After two months, add the `removes:` enforcement. Graduated adoption prevents the gate from becoming a blocker before the team has the muscle to honor it. ### See also - [`universal.md`](./universal.md) — Rule 1. - [`merge-rules-pattern.md`](./merge-rules-pattern.md) — `merge-override:` annotation. - [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md). - [`../../scripts/README.md`](../../scripts/README.md) — gate reference impl. ==== https://playbook.agentskit.io/docs/pillars/governance/tombstone-pattern --- title: 'Tombstone Pattern' description: 'How to retire a doc, plan, ADR, screen, or package without losing the trail.' --- # Tombstone Pattern How to retire a doc, plan, ADR, screen, or package without losing the trail. ## TL;DR (human) Retired content keeps its file but gets a 🪦 status block prepended. Back-references are updated to mark it retired without removing the link. The historical record stays intact; the agent reading the file knows immediately it is no longer active. ## For agents ### Why not delete Deletion loses three things: 1. **The decision trail.** Future agents need to know that something *was* the answer, *why* it was, and *why it changed*. The file is the evidence. 2. **Back-references.** Other docs / ADRs / commit messages link to it. Hard links break; agents follow broken links and waste a turn. 3. **Audit history.** Compliance / governance / security reviews need to see what existed and when. Git history preserves *some* of this, but git history is hard to discover. The file at its path with a clear retirement marker is discoverable in one read. ### Tombstone block format Prepend to the top of the file, **above the original title**: ```markdown > 🪦 **TOMBSTONED YYYY-MM-DD** — . > Kept for trail; do not treat as active. # ``` Required fields: - `🪦` emoji — visual signal. - `TOMBSTONED` — the keyword. - `YYYY-MM-DD` — when it was retired. ISO date. - Reason — one line. - Pointer to the replacement (if any). ### When to tombstone | Situation | Action | |---|---| | A plan / initiative is complete | Tombstone with status summary | | An ADR is superseded by a new ADR | Change Status to "Superseded by ADR-NNNN"; that is the tombstone form for ADRs | | A doc describes a removed surface | Tombstone with link to whatever replaced it | | A doc is wrong and the correct content lives elsewhere | Tombstone, do not edit in place — preserves the wrong-but-historical content | ### When **not** to tombstone | Situation | Action | |---|---| | Build artefact / generated file | Just delete | | Pure typo / formatting fix | Edit in place | | Doc that was never published / never linked | Edit or delete; there is no trail to preserve | | ADR that was rejected (never accepted) | Leave Status: Rejected; do not tombstone — it has historical value as-is | ### Back-reference sweep When you tombstone, sweep references: 1. `grep -r "path/to/tombstoned.md"` — every other doc that links here. 2. Update each linking doc to either: - Update the link target to the replacement, **or** - Leave the link but add a parenthetical "(🪦 retired)". 3. Update top-level indexes (`README.md`, `docs/README.md`, `INDEX.md`) — the file may still appear, but its row says "🪦 retired". Goal: no surprise. A reader landing on a back-reference learns immediately that the target is retired. ### Tombstones are immutable Once a file is tombstoned, do not edit its body. The body is the historical record. The only allowed edit is to update the tombstone block itself (e.g. fix a typo, add a more current replacement link). If the body needs to be updated to "be correct again", you do not want a tombstone — you want an edit. ### Rolling up tombstones After a long campaign of work, many tombstoned files accumulate. They are still discoverable and that is good. But indexes can get noisy. Periodic clean-up: 1. Move tombstoned files into a sibling `_archive/` directory. 2. Update back-references to the new path. 3. Index file lists archived items at the bottom in a collapsed section. This is **not deletion**. It is *demotion in discoverability*, with the trail intact. ### Common failure modes - **Tombstone without back-reference sweep.** Other docs still treat the retired file as authoritative. → Always sweep. A `check:back-refs` gate helps. - **Tombstone followed by "actually let me edit this".** Confuses readers — is it retired or not? → Decide first; if you edit, it's not a tombstone. - **Delete + new file with same name.** Loses the trail entirely. → Tombstone the old, new file goes by a new name (or same name with a clear `### Replaces tombstoned \` block at top). - **Tombstone with no reason.** Reader has no idea why it was retired. → One-line reason is mandatory. ### See also - [`universal.md`](./universal.md) — Rule 4. - [`../architecture/adr-pattern.md`](./../architecture/adr-pattern.md) — Status: Superseded is the ADR-specific tombstone. ==== https://playbook.agentskit.io/docs/pillars/governance/universal --- title: 'Governance — Universal Principles' description: 'How multiple contributors — agents and humans — coordinate so the whole sums.' --- # Governance — Universal Principles How multiple contributors — agents and humans — coordinate so the whole sums. ## TL;DR (human) Eight rules. They make multi-author work additive instead of subtractive. Without them, the second agent silently undoes the first agent's work; the third reviewer cannot tell what changed; the fourth release ships a regression that "no one merged". 1. Every PR declares intent up front. 2. Merges sum work — removing peer work needs explicit justification. 3. Decisions are documented (ADR / RFC) before they ship. 4. Tombstone retired work; never silently delete. 5. One sub-unit per PR; one PR per session. 6. Verify-first close — confirm the issue is still open before "fixing" it. 7. Concurrent agents notice each other (search, fetch, check state). 8. Phased work ships in a chain; each phase is independently complete. ## For agents ### Rule 1 — Every PR declares intent up front Each PR has a manifest in the description (or a `pr-intent.yaml` file in the diff). The manifest lists: - `summary` — one sentence. - `adds` — new exported symbols, new files. - `changes` — existing symbols whose behavior changed. - `removes` — symbols / files deleted. - `tests` — tests added / updated. - `docs` — docs added / updated. - `gates` — gates expected to be green. A gate parses the manifest and verifies it against the diff. Mismatch fails the PR. Template: [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md). **Failure mode prevented:** PRs whose description does not match the diff; reviewers approving claims that the diff contradicts; agents quietly expanding scope mid-session. ### Rule 2 — Merges sum work — removing peer work needs explicit justification Two specific protections: 1. **`removes:` is mandatory.** If your diff deletes an exported symbol you did not author, the manifest must include a `removes:` entry with a justification (why this is safe; what replaces it). 2. **`merge-override:` for `--theirs`/`--ours`.** If you used those flags to resolve a conflict, the manifest must include a `merge-override:` entry explaining why dropping one side was correct. The gate fails the PR if either is missing when the diff calls for it. **Failure mode prevented:** agents silently dropping peer work during conflict resolution; agents deleting "obsolete" code that turns out to be used by another package. ### Rule 3 — Decisions are documented (ADR / RFC) before they ship Architecture changes → ADR. Breaking-contract changes → RFC. The doc IS the change. The code implements the doc. A PR that ships architecture without a referenced ADR is incomplete. Cross-cutting reference: [`../architecture/adr-pattern.md`](../architecture/adr-pattern.md), [`../architecture/rfc-pattern.md`](../architecture/rfc-pattern.md). **Failure mode prevented:** rules that "everyone knows" but no one can cite; future agents reverting decisions because they cannot find the rationale. ### Rule 4 — Tombstone, never silently delete When a doc / plan / ADR / screen / package is retired: 1. Prepend a tombstone block: ```markdown > 🪦 **TOMBSTONED \** — superseded by [\](./...). Kept for trail. ``` 2. Keep the body. 3. Update the back-references (index pages) to mark it retired without removing the link. Why: the doc / decision / plan is part of the historical record. Future agents may need to understand why it existed and why it was retired. Deletion loses both. **Exception:** purely generated artefacts (build outputs, CI reports). Tombstone source-of-truth content, not build artefacts. **Failure mode prevented:** retired plans re-discovered six months later because no one knows they were retired; conflicting docs because the old version was deleted instead of marked. ### Rule 5 — One sub-unit per PR A sub-unit is one discrete, shippable change. - Cross-cutting refactor → split into one PR per affected package, chained. - "While I'm here" expansions → split into a follow-up PR. - A bug fix bundled with a refactor → split. The reviewer must be able to read the PR end-to-end and understand the intent in one sitting. If they cannot, the PR is too big. **Failure mode prevented:** PRs that combine unrelated changes; reviewers approving a refactor along with a bug fix without verifying both; subsequent agents reverting half the PR because they only understood the other half. ### Rule 6 — Verify-first close Before "closing" an issue: 1. `gh issue view \ --json state` — is it still open? 2. Re-read the issue's DoD. Did your work meet it? 3. Look at peer-closed PRs referencing the same issue. Did someone close it concurrently? This was the single highest-yield governance discipline in production multi-agent work — agents repeatedly grinding on already-closed issues. **Failure mode prevented:** dup PRs that get rejected; agents claiming "fixed" issues they did not actually meet the DoD for. ### Rule 7 — Concurrent agents notice each other Before starting work in a path: - `gh pr list --search "is:open \"` — are peer PRs touching this? - `git log origin/main..HEAD --name-only` — what has main changed since you forked? - Read peer PR descriptions. You may be redundant. This is **search, not coordination**. The agents do not have to talk; the repo records who is doing what. See [`../ai-collaboration/concurrent-agent-pattern.md`](../ai-collaboration/concurrent-agent-pattern.md) for the full defensive checklist. **Failure mode prevented:** two agents producing two PRs for the same fix; conflict storms at merge time; agents reverting each other's work in successive PRs. ### Rule 8 — Phased work ships in a chain Big initiatives are too large for one PR. Split into phases: - Each phase is independently shippable (passes gates, is reviewable). - Each phase is merged with `--admin` (after gates pass) before the next phase opens. - The next phase forks from fresh `main`, not from the previous phase's branch. - A tracker issue lists all phases with their status. Why merge before opening the next phase: keeping a chain of N open PRs causes catastrophic conflicts as main moves. One-at-a-time costs slightly more wall-clock; saves enormously in conflict resolution. **Failure mode prevented:** mega-PRs that cannot be reviewed; long-lived branches that conflict catastrophically with main; "phase 2 PR" that no longer applies cleanly because the assumptions of phase 1 changed. ## See also - [`../../templates/PR-intent.template.md`](../../templates/PR-intent.template.md) - [`pr-intent-pattern.md`](./pr-intent-pattern.md), [`merge-rules-pattern.md`](./merge-rules-pattern.md), [`tombstone-pattern.md`](./tombstone-pattern.md), [`phased-pr-pattern.md`](./phased-pr-pattern.md) - [`../ai-collaboration/concurrent-agent-pattern.md`](../ai-collaboration/concurrent-agent-pattern.md) - [`../architecture/adr-pattern.md`](../architecture/adr-pattern.md), [`../architecture/rfc-pattern.md`](../architecture/rfc-pattern.md) ==== https://playbook.agentskit.io/docs/pillars/quality --- title: 'Pillar — Quality' description: 'How to know the code works without manually reviewing every agent-produced diff.' --- # Pillar — Quality How to know the code works without manually reviewing every agent-produced diff. ## Status ◐ Scoped, not yet detailed. ## Scope | Concern | Universal principle | Concrete pattern | |---|---|---| | Test pyramid | Unit > integration > E2E; cover the boundary contracts heavily | Vitest unit + Playwright E2E + contract-level params/result parse tests | | Coverage target | >90% per shipped package, measured against statements | Per-package coverage threshold in CI; per-package, not whole-repo | | Mutation testing | Beats coverage as a quality signal once unit suite is good | Stryker / mutation tool on stable utilities first | | Hermetic tests | Component-level vitest preferred over live-app E2E | Reproduce + lock bugs via in-process tests, not Playwright | | Verify-first close | Before reproducing an issue, check if it's already fixed | Default `gh issue view \` at session start | | File-size gate | See [architecture pillar](../architecture/file-size-budget.md) | Baseline shrink-only | | Lint gates | No `any`, no `console.log`, no default exports, no nested ternaries, no raw HTML | ESLint rule pack + per-file overrides | | Quality-gates script | One `pnpm check:quality-gates` for fast structural checks | Parallel: lint + typecheck + secrets + size + intl + tokens | | Sanity script | One `pnpm sanity` for cross-cutting rule audit | Generates `docs/audit/sanity-report.md`; CI fails on regressions | | Pre-push hook | Runs structural gates + ADR/RFC checks; not full tests | Husky `pre-push`; tests on CI | | Concurrency safety | Agents merge PRs against fast-moving main | Stash-verify red, rebase, retry; never `--theirs`/`--ours` blindly | ## Non-negotiables 1. **Tests are part of the diff.** No "tests next PR". 2. **Coverage is per package, not aggregate.** Aggregate hides which package is bad. 3. **Hermetic over E2E for bug repro.** Component tests fail in 2s; Playwright fails in 60s and lies more. 4. **Gates produce actionable messages.** "Lint failed" is not actionable. "src/x.ts:42 — no `any` in boundary file; use `unknown` and parse." is. 5. **Pre-push is the safety net, not the proof.** Run `check:all` before a release. ## See also - [`../architecture/file-size-budget.md`](../architecture/file-size-budget.md) - [`../governance/README.md`](../governance/README.md) — PR-intent ties tests to claims. - [`../../scripts/`](../../scripts/) — gate reference impls. ## Documents in this pillar | Doc | Read when | |---|---| | [`universal.md`](./universal.md) | First read; the 9 non-negotiables | | [`test-pyramid.md`](./test-pyramid.md) | Test-tier distribution + escalation | | [`quality-gates-pattern.md`](./quality-gates-pattern.md) | Structural gate suite + orchestrator | | [`pre-push-pattern.md`](./pre-push-pattern.md) | Three-tier hook split | | [`sanity-pattern.md`](./sanity-pattern.md) | Cross-cutting audit | | [`mutation-testing-pattern.md`](./mutation-testing-pattern.md) | Beyond coverage | | [`observability-pattern.md`](./observability-pattern.md) | Metrics / logs / traces / SLOs | | [`performance-budgets-pattern.md`](./performance-budgets-pattern.md) | Bundle / latency / resource budgets | | [`chaos-engineering-pattern.md`](./chaos-engineering-pattern.md) | Controlled fault injection | | [`ci-cd-pipeline-pattern.md`](./ci-cd-pipeline-pattern.md) | Commit → prod pipeline; caching; deploy patterns; DB migrations | | [`alerting-runbooks-pattern.md`](./alerting-runbooks-pattern.md) | SLO burn-rate alerts; runbook 5-section template; tuning loop | | [`cost-optimization-pattern.md`](./cost-optimization-pattern.md) | FinOps; per-tenant attribution; right-sizing; commitments + spot | | [`contract-testing-pattern.md`](./contract-testing-pattern.md) | Pact + schema-first; consumer-driven contracts; broker; can-i-deploy | | [`product-analytics-experimentation-pattern.md`](./product-analytics-experimentation-pattern.md) | Event tracking; funnels + cohorts; A/B experiments; holdouts | ==== https://playbook.agentskit.io/docs/pillars/quality/alerting-runbooks-pattern --- title: 'Alerting + Runbooks Pattern' description: 'How to turn observability data into a healthy alerting setup — pages that matter, runbooks that work, alert hygiene that prevents pager fatigue.' --- # Alerting + Runbooks Pattern How to turn observability data into a healthy alerting setup — pages that matter, runbooks that work, alert hygiene that prevents pager fatigue. ## TL;DR (human) Alert on **user-impacting** failures, not on every anomaly. Each alert has a runbook with five sections (symptoms / verify / mitigate / diagnose / resolve). Tune alerts on a cadence: aim for > 80% page-worthy ratio. Pager fatigue is the silent killer of on-call quality. ## For agents ### What deserves an alert A signal becomes an alert when **all** of these hold: 1. **User-impacting** (or will be soon). 2. **Actionable** by the on-call responder. 3. **Recoverable** within minutes-to-hours by a single responder. 4. **Confirmable** (the responder can verify the alert is real). If any condition fails, it's not an alert: - Not user-impacting → dashboard / weekly review. - Not actionable → diagnostic; not paging-worthy. - Recovery requires team effort → SEV escalation; broader response, not single page. - Cannot confirm → false-positive risk; tune the signal first. ### Alert types **Symptom alerts** (preferred): "the user-visible thing is broken". - p95 latency for journey X > Y for N minutes. - Error rate for endpoint Z > 1%. - Synthetic check for golden path failing. **Cause alerts**: "an underlying thing is failing". - DB connection pool saturation. - Disk > 90% full. - Queue depth growing > rate. Symptom alerts catch what users see. Cause alerts catch what produces those symptoms. Both are needed. Symptom alerts as the primary; cause alerts as predictive (catch before the symptom). ### Burn-rate alerting (SLO-based) Modern best practice: alert on SLO **error budget burn rate**, not on raw thresholds. ``` SLO: 99.9% over 30 days Error budget: 0.1% = 43.2 minutes of badness ``` Burn rate = current error rate / acceptable error rate. | Burn rate | Time to exhaust 30-day budget | Page? | |---|---|---| | 1× | 30 days | No | | 2× | 15 days | Maybe — slow burn | | 14× | 2 days | Yes (high severity) | | 100× | 7 hours | Yes (critical) | Multi-window approach (catches both fast and slow burn): - **Fast window** (5 min): burn rate > 14× over 5 min → critical alert. - **Slow window** (1 h): burn rate > 1× over 1 h → warning. This avoids both pager fatigue (slow degradation pages too often) and silent budget exhaustion (slow drift not caught). ### Per-alert anatomy Every alert config includes: ```yaml - name: payments-p95-latency-burn-rate description: Payment endpoint p95 latency violating SLO severity: SEV-1 query: | (rate(http_request_duration_seconds{handler="/api/charge",quantile="0.95"} > 0.5) > ...) windows: - 5m, threshold 14 - 1h, threshold 1.5 runbook: https://runbooks.example.com/payments-latency team: payments page_at: SEV-1 ``` Required fields: - **Name + description**: human-readable. - **Severity**: maps to paging policy. - **Query**: the actual signal. - **Windows + thresholds**: when to fire. - **Runbook URL**: link. - **Team**: routing. If you can't fill `runbook` URL, the alert isn't ready. ### Runbook structure Five sections: ```markdown # Runbook — Payments p95 latency violating SLO ## Symptoms - Alert: `payments-p95-latency-burn-rate` firing. - User-visible: checkout slow; customer reports. - Dashboard: payments p95 dashboard shows red. ## Verify (1 minute) - Open dashboard: - Confirm p95 elevation is real (not metric spike). - Check if related alerts also firing (DB, payment-provider, dependency). ## Mitigate (5 minutes) - If a recent deploy: roll back (last good revision: `git log origin/main`). - If a payment provider issue: flip kill-switch to fallback provider (see `flags-pattern.md`). - If a DB issue: route to replica (see `distributed-data-pattern.md`). ## Diagnose (15 minutes) - Trace: . - Recent deploys: . - Provider status: . - Anomalies: . ## Resolve - If root cause is in our code: ticket; fix in next PR. - If root cause is provider: comms to customers; track provider's resolution. - Post-mortem within 5 business days. ## Last reviewed YYYY-MM-DD by @owner. Next review: quarterly. ``` A runbook last reviewed > 6 months ago is suspect. ### Alert routing | Severity | Action | Who | Mechanism | |---|---|---|---| | SEV-1 | Page primary | On-call primary | PagerDuty / Opsgenie / Splunk On-Call | | SEV-2 | Page primary | On-call primary | Same; possibly different paging policy | | SEV-3 | Notify | Team channel | Slack / Teams | | SEV-4 | Ticket | Backlog | Jira / Linear / GitHub Issues | Per-team routing: the alert knows which team owns the affected service. ### Alert tuning loop (quarterly) For every alert in the system, ask: | Question | Action if "no" | |---|---| | Did this fire this quarter? | If never: delete (or document as latent SEV-1 sentinel) | | Was every fire actionable? | If < 80%: tune threshold / scope | | Did the runbook help? | If no: rewrite or delete the runbook | | Did the fire correlate with user impact? | If no: convert to non-paging dashboard | | Has the underlying alert query changed since last review? | If yes: re-validate threshold | Track: - **Page-worthy ratio** per alert (actionable fires / total fires). - **MTTA** (mean time to acknowledge) per alert. - **Pages per shift** (target: < 2 per week per responder; ≤ 0 ideal). ### Alert fatigue — the cycle Too many alerts → responders ignore → real alert missed → outage → more alerts added → fatigue worsens. Symptoms: - Pages auto-acknowledged without reading. - "I'll look at it after my coffee" responses. - New team members ignored on first pages. - Outages where the alert was present but ignored. Recovery: - Delete more aggressively than you add. - Audit alerts quarterly with the team that gets paged by them. - Move medium-signal alerts to dashboards. ### Synthetic checks For golden paths (login, checkout, primary-feature-X): - Synthetic check from outside the system, every minute (or 5 min). - Multi-region (catches geo-localised failure). - End-to-end (not just "ping returns 200"). - Alert on N consecutive failures (avoid single-shot false positives). Synthetic checks catch what internal metrics miss — network-edge issues, DNS, CDN, third-party-provider-down. ### Service health page Per service: - Currently-firing alerts. - Recent incidents. - SLO burn rates. - Recent deploys. - Owning team + on-call contact. A single URL per service that answers "is this service okay?" without expertise. ### Alert dependencies Some alerts depend on others. The classic trap: - Service A's alert fires. - Service B (which depends on A) also alerts; redundant. - Service C (which depends on B) also alerts; double-redundant. - One incident → 7 pages. Mitigations: - **Alert grouping**: cluster related alerts into one notification. - **Dependency-aware suppression**: if A is firing, suppress dependents until A resolves. - **One service of record**: only the leaf service's alert pages; uphill services log but don't page. ### Alert blast-radius When an alert fires, what's the radius? - One service → service team paged. - Multi-service → cross-team incident; escalate to IMOC (see [`../security/on-call-rotation-pattern.md`](../security/on-call-rotation-pattern.md)). - Whole system → SEV-1; full incident response. Routing logic accounts for radius: - Single-service alert → team page. - Multi-service correlation → IMOC page. - Auth / billing / audit failure → security team + IMOC. ### Pre-alert: dashboards + logs Not every signal alerts. Three tiers: 1. **Pages**: the small set that pages someone. 2. **Dashboards**: visible during business hours; reviewed weekly. 3. **Logs / traces**: queryable on demand; not surfaced unless investigated. The split avoids both fatigue (everything pages) and silent drift (nothing surfaces). ### Common failure modes - **Alert on every error**. Pager fatigue. → Symptom alerts; SLO burn rates. - **No runbook**. Responder guesses. → Mandatory runbook URL on every alert. - **Runbook says "see logs"**. Useless. → Specific queries, links, commands. - **Alert fires for stale reasons**. Half the team ignores; new joiners don't. → Quarterly tune. - **No synthetic checks**. Real users feel outage before metrics. → Synthetic for golden paths. - **Per-host alerts in cattle-not-pets infra**. One pod restart = page. → Aggregate; alert on the abnormal rate, not individual. - **No grouping**. One incident = 20 pages. → Group; suppress dependents. ### Tooling stack (typical) | Concern | Tool | |---|---| | Alert engine | Alertmanager (Prometheus), Grafana Alerting, native cloud (CloudWatch, GCP Monitoring) | | Paging | PagerDuty, Opsgenie, Splunk On-Call, VictorOps | | Synthetic checks | Datadog Synthetics, Pingdom, Checkly, native cloud | | Dashboards | Grafana, Datadog, native cloud | | SLO management | Sloth (Prometheus SLO generator), Datadog SLOs, in-house | | Runbooks | Confluence, repo-committed markdown, FireHydrant, Jeli | | Incident tracking | FireHydrant, Jeli, Incident.io | ### See also - [`observability-pattern.md`](./observability-pattern.md) — the data; alerts read from it. - [`../security/on-call-rotation-pattern.md`](../security/on-call-rotation-pattern.md) — rotation that responds to pages. - [`chaos-engineering-pattern.md`](./chaos-engineering-pattern.md) — drills test runbooks. - [`../security/secrets-leak-postmortem-playbook.md`](../security/secrets-leak-postmortem-playbook.md) — runbook example. ==== https://playbook.agentskit.io/docs/pillars/quality/chaos-engineering-pattern --- title: 'Chaos Engineering Pattern' description: 'How to find failure modes before customers do — deliberately, in a controlled way.' --- # Chaos Engineering Pattern How to find failure modes before customers do — deliberately, in a controlled way. ## TL;DR (human) Chaos engineering injects controlled faults (network blip, dependency timeout, region down) into a system that is observable and recoverable, and watches what happens. Done well, it surfaces hidden coupling and missing retries. Done badly, it is "production outages we caused". Start small, observe everything, increase scope only as confidence grows. ## For agents ### Preconditions Before injecting any chaos: 1. **Observability** mature enough to see what failed and where ([`observability-pattern.md`](./observability-pattern.md)). 2. **SLOs** defined; error budget tracked. 3. **Rollback path** for any injection (turn it off immediately). 4. **Blast-radius limit**: per-injection scope is bounded (one service, one cell, one tenant — not the whole system). 5. **Stakeholder buy-in**: on-call, support, leadership know an injection is happening. Without these, "chaos" is just "outage". ### Fault classes | Fault | What it tests | Tooling | |---|---|---| | **Network latency injection** | Tolerance to slow dependencies | tc / Toxiproxy | | **Network failure (drop)** | Retry + circuit-breaker behavior | tc / Chaos Mesh | | **Service kill** | Failover behavior | kubectl delete pod, Chaos Monkey | | **Disk fill** | Disk-pressure handling | Chaos Mesh | | **CPU saturation** | Quality-of-service under load | stress-ng, Chaos Mesh | | **Clock skew** | Time-sensitive logic | libfaketime | | **Region down** | Multi-region failover | Cloud-provider region rollover | | **DNS failure** | Resolver fallback | Toxiproxy, dnsmasq | | **Database failover** | Connection-pool re-bind | Cloud-provider managed failover trigger | ### Game-day discipline A game day is a scheduled session where the team: 1. **Hypothesises** what happens when fault X is injected. 2. **Injects** X. 3. **Observes** what actually happens. 4. **Diff** hypothesis vs reality. 5. **Documents** + fixes any discovered gaps. Cadence: monthly for early-stage; quarterly when mature. A game day that finds nothing is not a failure — it confirms current resilience. A game day that finds something is high-yield. ### Start small Order of adoption: 1. **Staging**: inject in staging first. No customer impact. 2. **Production, off-peak, one cell**: smallest blast radius. 3. **Production, business hours, one cell**: tests on-call awareness. 4. **Production, multi-cell**: largest scope; only after years of practice. Skipping steps is how chaos engineering becomes chaos. ### Hypothesis-first For every injection, write the hypothesis first: ``` Hypothesis: When `payment-service` becomes unresponsive for 60s: - `/api/checkout` returns 503 with PAYMENT_UNAVAILABLE within 5s (circuit-break working). - Affected user count rises by < 0.1%. - `payment-service` recovers automatically within 30s of fault removal. - No data is lost; in-flight charges either complete or are correctly rolled back. ``` After the injection: did each prediction hold? Where it didn't, document the gap. This forces explicit reasoning about resilience instead of "let's see what breaks". ### Specific fault recipes **Dependency timeout test**: - Inject 30s latency on the path to a non-critical service. - Verify: caller times out after configured timeout, falls back to default, surfaces a clear UX message, does not pile up requests. **Failover test**: - Kill the primary DB. - Verify: failover triggers within RTO; replicas catch up; no data lost within RPO; reads gracefully tolerate the transition window. **Saturation test**: - Drive traffic to 110% of normal. - Verify: rate limits engage; back-pressure cleanly; SLOs degrade gracefully (not catastrophically); error rates rise but stay actionable. **Region failure test**: - Block traffic to one region. - Verify: geo-DNS routes around; affected tenants experience a brief blip; cross-region replication is still consistent post-recovery. ### Continuous chaos When mature, chaos becomes continuous, not scheduled: - **Production traffic with built-in noise**: small random faults injected at all times (e.g. 1% of requests delayed 100ms). - **Pre-deploy chaos sweep**: before promoting a release, run a battery of injections on the canary. This builds resilience habit into the dev cycle: "I expect this to be reliable under noise" becomes the default mental model. ### What you learn Game days repeatedly surface: - **Missing timeouts**: a call without a timeout hangs forever; cascades. - **Missing retries**: transient failures should retry with backoff; sometimes don't. - **Retry storms**: every layer retries, multiplying load when the dependency comes back. - **Circuit-breakers absent**: a degraded dependency should be skipped; sometimes isn't. - **Coupling you didn't know about**: service A depends on service B implicitly (through a shared cache or library default). - **UX during failure**: "something went wrong" instead of "your payment is being processed, give us 30 seconds". - **Alerts that didn't fire**: the alert depended on the very system that failed. Each finding produces an ADR / RFC / fix. ### Anti-patterns - **"Chaos Monkey" without observability.** Pods die; nobody knows why. → Observability first. - **Injecting in production with no rollback.** Outage. → Always have the kill-switch. - **Hypothesis written after the fact.** Confirmation bias. → Hypothesis first. - **Injecting in customer-impacting ways without notice.** Trust erosion. → Communicate; off-peak; minimal blast radius initially. - **Single-handed chaos.** One engineer; everyone else surprised. → Team activity; everyone learns. ### Tooling stack (typical) | Concern | Tool | |---|---| | Kubernetes-native chaos | Chaos Mesh, LitmusChaos | | Network proxy chaos | Toxiproxy | | Cloud provider chaos | AWS Fault Injection Service, Azure Chaos Studio | | Application-level chaos | Gremlin | | Pod / process kills | Chaos Monkey (Netflix origin), Pumba | ### Common failure modes - **Chaos in name only**: "we run game days" with no real injection. → Inject for real. - **Chaos becomes a checkbox**: scheduled but the same scenarios every quarter. → Rotate scenarios; cover new surfaces as they ship. - **No follow-up on findings**: game day surfaces issues; nothing changes. → Each finding gets an issue + owner. - **Adoption too aggressive**: production chaos before staging chaos. → Phased; trust earned. ### See also - [`observability-pattern.md`](./observability-pattern.md) — chaos without observation is malice. - [`performance-budgets-pattern.md`](./performance-budgets-pattern.md) — chaos tests budget compliance under fault. - [`../architecture/multi-region-pattern.md`](../architecture/multi-region-pattern.md) — failover drills are chaos. - [`../security/audit-ledger-pattern.md`](../security/audit-ledger-pattern.md) — chaos events themselves are audit-worthy. ==== https://playbook.agentskit.io/docs/pillars/quality/ci-cd-pipeline-pattern --- title: 'CI/CD Pipeline Pattern' description: 'How to wire continuous integration + continuous delivery so the path from commit to production is short, verifiable, reversible, and boring.' --- # CI/CD Pipeline Pattern How to wire continuous integration + continuous delivery so the path from commit to production is short, verifiable, reversible, and boring. ## TL;DR (human) Five stages: **lint/format → typecheck → unit → integration → deploy**. Each stage has a clear pass/fail signal. Caching makes warm runs fast; matrix sharding makes cold runs fast. Branch protection on main; trunk-based development. Deploys are reversible (one-click rollback) and progressive (canary → blue-green → ramp). Production is reached only by passing every stage; no human can manually push. ## For agents ### Pipeline stages (default order) | Stage | What runs | Typical duration | Gate? | |---|---|---|---| | **Lint + format** | ESLint, Prettier, framework lints | < 1 min | Hard block | | **Typecheck** | tsc -b, mypy, etc. | 1–3 min | Hard block | | **Structural gates** | file-size, no-any, named-exports, etc. (`quality-gates`) | < 30 s | Hard block | | **Unit + contract tests** | tier 1–2 from test pyramid | 1–5 min | Hard block | | **Integration tests** | tier 3 (in-process, no external services) | 2–10 min | Hard block | | **Build** | bundle + artifact creation | 1–5 min | Hard block | | **Bundle-size gate** | per-route size budget | < 30 s | Hard block | | **E2E (tier 5)** | smoke + golden paths only | 5–15 min | Hard block | | **Security scan** | SAST + dependency CVE scan | 1–5 min | Hard block (critical/high) | | **Deploy to staging** | automatic on green main | 2–10 min | n/a | | **Smoke against staging** | tier 5 against real staging | 2–5 min | Hard block before prod | | **Deploy to production** | canary first | progressive | Manual approval (or auto, per policy) | Total budget target: **< 20 minutes** from PR open to merge-ready signal. Past that, agents and humans context-switch and the loop breaks. ### Caching discipline Caches deliver most of the speedup: - **Package manager cache** (`pnpm store`, `node_modules`): keyed by lock-file hash. - **Build cache** (Turbo / Nx / Rush): keyed by inputs (source + deps). - **Test cache**: per-package; skip unchanged packages. - **Docker layer cache**: ordered for max hit rate (deps before source). Cold-cache run remains as the worst-case bound; design pipelines so warm-cache is the common case. ### Matrix sharding When the test suite is large, shard: - 4–8 parallel shards is the sweet spot. - Each shard runs a roughly-equal slice (by historical duration, not by alphabetical). - Tools: Vitest projects, Jest `--shard`, Playwright shards, pytest-xdist. Result aggregation: each shard reports pass/fail; one aggregate signal goes back to the PR. ### Branch protection (mandatory) Main branch rules: - Direct pushes blocked (everyone goes through PR). - Required status checks: lint, typecheck, structural, unit, integration, build, security. - Required reviewers: ≥ 1 human (or per your policy). - Linear history (rebase merge); no merge commits unless explicitly allowed. - Force-push blocked. Per-branch policy: long-lived branches (`develop`, `epic/*`) have similar rules; ephemeral feature branches don't need them. ### Trunk-based development The recommended model: - Main is always shippable. - Feature branches are short-lived (hours to days). - Large changes go behind feature flags (per [`../architecture/feature-flags-pattern.md`](../architecture/feature-flags-pattern.md)) instead of long-lived branches. - No `develop` / `staging` / `qa` branches that diverge. Why: long-lived branches accumulate conflicts; short-lived branches keep concurrent-agent coordination tractable. ### Deploy patterns **Canary**: deploy to 1% of traffic; observe SLOs for N minutes; promote on pass. **Blue-green**: full new stack stands up; traffic flips when ready; old stack stays for instant rollback. **Rolling**: incremental replacement of instances; cheaper but slower rollback. **Feature-flag ramp**: code deployed everywhere; flag controls user-visible rollout (0% → 1% → 10% → 100%). Most products combine: blue-green + feature flags. Canary as a pre-ramp validation. ### Rollback discipline Every deploy is reversible. One-click. Within 5 minutes. Mechanisms: - **Re-deploy previous version**: previous artifact is retained for N versions. - **Feature-flag flip**: turn off the new behavior (faster than redeploy). - **DB migration backward-compat**: every migration must support N and N-1 application versions concurrently. No destructive migrations without a 2-phase deprecation (add new → backfill → switch → drop old). A deploy you cannot roll back is a one-way door. Avoid; if unavoidable, treat as RFC. ### Database migration discipline Migrations are the trickiest CI/CD territory: - **Forward-only** in production. Roll-back via roll-forward (a new migration that reverses). - **Backward-compat**: app version N and N+1 must both work with schema version M during deploy. - **Two-phase column changes**: 1. Add new column (deployable any time). 2. Backfill data. 3. Code reads from new, writes to both. 4. Code reads + writes new only. 5. Drop old column (deployable any time). This sequence enables rollback at any step. Single-step "drop column" mid-deploy is the canonical post-mortem. ### Tier-by-tier signal Each stage exits with a clear signal: - **0**: pass. - **1**: fail (rule was clear; fix it). - **2**: error (the gate itself crashed; investigate). Aggregated dashboard shows: which stage, which check, when, with link to logs. Agents triaging a red build go to the specific failed gate, not "the build failed". ### Caching gotchas - **Cache poisoning**: a bad build polluted the cache; next builds use it; fail mysteriously. → Cache key includes a salt / version stamp; bumped on infrastructure change. - **Stale cache mask real failures**: tests pass against cached artifacts; real artifact would fail. → Periodic cold-cache runs (nightly). - **Cache size growth**: cache costs more than the compute savings. → TTL on cache entries; max-size enforced. ### Pre-merge vs post-merge Two strategies for placing expensive checks: **Pre-merge**: every PR runs full pipeline. Slow PRs; high confidence at merge. **Post-merge with revert**: PRs run fast subset (lint + typecheck + unit); merge fast; full suite runs post-merge; if red, revert. Pre-merge for products with low merge frequency. Post-merge with auto-revert works for very high frequency. ### Auto-merge When the queue has 10+ PRs waiting: - Configure auto-merge: when checks pass + reviewer approved, merge automatically. - Merge queue (GitHub native, or Mergify): orders merges; each tested against the queued tip. Without a merge queue, fast-merge PRs invalidate each other's CI status. ### Build artifacts What you ship is what you tested. Artifact discipline: - Build once at the start of the pipeline. - All subsequent stages test that artifact. - Same artifact promotes through staging → production. - Never rebuild between stages. Rebuilding between stages introduces drift; tests pass; production differs. ### Secrets in CI Secrets reach CI via: - **Built-in secret store** (GitHub Actions secrets, GitLab CI variables, etc.). - **OIDC federation** to cloud provider (preferred): CI proves identity; cloud grants temporary credentials; no long-lived secrets in CI. - **Vault integration**: CI fetches from your vault. Discipline: - Secrets per environment (staging vs prod). - Per-job scope (job that doesn't need a secret can't read it). - Rotation per [`../security/vault-pattern.md`](../security/vault-pattern.md). - Audit-log secret access. ### Build provenance Per [`../security/vulnerability-mgmt-pattern.md`](../security/vulnerability-mgmt-pattern.md): - Sign build artifacts (cosign). - Attest the build process (SLSA framework). - Consumers verify chain. Provenance attaches CI → commit → artifact, traceably. ### Per-pillar concerns at CI/CD scope **Security**: SAST scan in pipeline; SBOM generated; secrets scan; vulnerability triage. **Quality**: gate suite (structural + unit + integration); coverage thresholds; bundle budgets. **UI-UX**: a11y axe scan; visual regression; intl parity. **Architecture**: schema diff (RFC enforcement); ADR/RFC integrity. **Governance**: PR-intent gate; required reviews; changeset present. **AI-collaboration**: prompts updated when CI conventions change. ### Common failure modes - **20-minute pipeline.** Agents context-switch. → Cache; shard; ruthlessly cut. - **Flaky tests**. PRs retry, eventually pass. Real failures hide. → Track flake rate; quarantine; fix or delete. - **Long-lived `develop` branch**. Diverges from main. → Trunk-based; flags for in-flight. - **Manual deploy step**. Toil; mistakes. → Automate end-to-end. - **Rollback "is just redeploying old version"**. Slow. → Feature flag flip; blue-green flip. - **Single environment**. No staging. → At minimum staging that resembles prod. - **No build provenance**. Cannot prove what came from where. → Sign + attest. - **Cache poisoning hidden**. Bad cache survives. → Nightly cold runs. ### Tooling stack (typical) | Concern | Tool | |---|---| | CI runner | GitHub Actions, GitLab CI, CircleCI, Buildkite | | Build cache | Turborepo, Nx, Bazel | | Container registry | GHCR, ECR, GCR | | Deploy | ArgoCD, Spinnaker, Flux, custom | | Feature flags | Unleash, LaunchDarkly, Flagsmith, in-house | | DB migrations | Atlas, Liquibase, Flyway, framework-native | | Secrets in CI | OIDC + cloud KMS, GitHub Secrets, Vault | | Provenance | sigstore, SLSA | | Merge queue | GitHub merge queue, Mergify | ### See also - [`pre-push-pattern.md`](./pre-push-pattern.md) — the local-side counterpart. - [`quality-gates-pattern.md`](./quality-gates-pattern.md) — gates that the pipeline runs. - [`../architecture/feature-flags-pattern.md`](../architecture/feature-flags-pattern.md) — flags decouple deploy from release. - [`../architecture/api-versioning-pattern.md`](../architecture/api-versioning-pattern.md) — DB migration backwards-compat. - [`../security/vulnerability-mgmt-pattern.md`](../security/vulnerability-mgmt-pattern.md) — SBOM + provenance. - [`../phases/05-ship/README.md`](../../phases/05-ship/README.md) — release-gate checklist. ==== https://playbook.agentskit.io/docs/pillars/quality/contract-testing-pattern --- title: 'Contract Testing Pattern' description: 'How to verify that consumer + provider agree on a contract, without expensive end-to-end tests.' --- # Contract Testing Pattern How to verify that consumer + provider agree on a contract, without expensive end-to-end tests. ## TL;DR (human) Contract tests sit between unit tests (in-process) and E2E (real services). They verify that "consumer A's expectations of provider B" match "provider B's actual behavior" — independently, without running both at the same time. Pact is the canonical tool; consumer-driven contracts the canonical methodology. ## For agents ### What a contract test is A contract is the shape of a request + response between two services. A contract test: 1. **Consumer** records its expectations as a pact file (which methods called, what params sent, what response expected). 2. **Provider** runs the pact file as a test (replays the expected requests; asserts responses match). If both pass, the contract holds. Consumer + provider can deploy independently. ### Consumer-driven vs provider-driven | Style | Driver | Use case | |---|---|---| | **Consumer-driven** | Consumer writes the contract; provider verifies | Many consumers; provider's job is to keep them all happy | | **Provider-driven** | Provider publishes a spec (OpenAPI); consumer verifies | One provider; many consumers; provider sets the shape | Consumer-driven (Pact) is more common in microservice meshes where a provider has many consumers and changes affect all. ### Where contract tests fit | Test type | Speed | What it tests | Hits real network? | |---|---|---|---| | **Unit** | <10ms | Pure logic | No | | **Contract** | 100ms-1s | Inter-service shape agreement | No (replayed) | | **Integration** | seconds | Multiple in-process components | Sometimes | | **E2E** | minutes | Whole system | Yes | Contract tests **replace many integration / E2E tests** that exist only to verify "does service A talk to service B". Those are slow and flaky; contract tests are fast and deterministic. ### Pact mechanics **Consumer side**: ```ts // consumer.test.ts const pact = new Pact({ provider: "user-service", consumer: "checkout-service" }); it("fetches user by id", async () => { await pact.addInteraction({ state: "user 42 exists", uponReceiving: "a request for user 42", withRequest: { method: "GET", path: "/users/42" }, willRespondWith: { status: 200, body: { id: 42, email: "x@y.com", role: "member" }, }, }); // Test the consumer code against the mock. const user = await fetchUser("http://mock-host", 42); expect(user.id).toBe(42); }); ``` Pact file emitted: a JSON record of expectations. **Provider side**: ```ts // provider.test.ts import { Verifier } from "@pact-foundation/pact"; it("honors all consumer pacts", async () => { await new Verifier({ providerBaseUrl: "http://localhost:3000", // real provider, running pactUrls: ["./pacts/checkout-service-user-service.json"], stateHandlers: { "user 42 exists": async () => { await seedUser(42); }, }, }).verifyProvider(); }); ``` Provider replays each interaction; asserts response matches. ### Pact broker A shared service stores pact files between consumer + provider CI: - Consumer publishes after passing. - Provider fetches + verifies. - Provider publishes its current state ("verified vN against consumer X"). - A "can-i-deploy" check verifies a new version doesn't break any consumer. Brokers: Pactflow (hosted), Pact Broker (self-hosted). ### Schema-first alternative If you already have an OpenAPI / Protobuf / GraphQL schema: - Provider tests verify implementation matches schema. - Consumer tests verify consumed shape matches schema. - A schema diff in CI catches breaks. Cheaper than Pact if you have the schema discipline. Pact wins when consumer-side expectations are richer than the schema (e.g. specific field combinations). ### When NOT contract testing - Single-service systems (no cross-service contracts). - Public APIs with thousands of unknown consumers (you can't get pacts; use schema versioning + telemetry). - Internal-only RPC where the contract package itself is the source of truth (per [`../architecture/contracts-zod-pattern.md`](../architecture/contracts-zod-pattern.md)) — Zod schemas + dispatcher tests do most of the same work. ### What contract tests catch that unit tests don't - Provider deployed with a field rename; consumer reads the old field; unit tests pass on both sides; production breaks. - Provider tightened validation; consumer sends payloads that now reject; unit tests pass. - Consumer started sending a new field the provider mis-parses. Each surfaces only at the integration point. Unit tests on each side individually wouldn't catch. ### Failure modes contract tests don't catch - Provider returns wrong **values** (correct shape, wrong content). → Integration tests. - Network behavior (timeouts, retries, partial failures). → Chaos tests. - Authentication / authorization correctness. → Security tests + integration. - Performance regressions. → Load tests. Contract tests verify **shape agreement**. They are not a substitute for other testing. ### Discipline - One pact file per consumer-provider pair. - Pact files committed to consumer's repo; published to broker on green CI. - Provider verification in CI; "can-i-deploy" gate before promote. - Pact file changes require consumer-team approval (changing expectations = changing contract). ### Anti-patterns - **Pact mocks the provider in production tests**. Mock testing the mock. → Real provider for end-to-end; pacts for the contract. - **Consumer pact tests pass without round-trip**. Consumer expects field X; provider doesn't return X; pact says "OK" because pact only asserts what's specified. → Strict matching mode. - **Pact files diverge from real provider over time**. → Broker + can-i-deploy gates. - **No state handlers**. Pact assumes data exists; provider doesn't have it. → State setup per interaction. ### Common failure modes - **Contract tests but no broker**. Pacts on individual machines; provider has no idea. → Set up Pact broker; CI publishes. - **No can-i-deploy gate**. Provider ships breaking change; consumers break. → Gate at deploy time. - **Pacts cover happy path only**. Error paths not contracted. → Cover errors too (the provider returns this code in this situation). - **Per-consumer differences not enforced**. Consumer A wants strict; consumer B wants lax. → One per consumer-provider pair. ### Tooling stack (typical) | Concern | Tool | |---|---| | Contract testing | Pact, Spring Cloud Contract | | Broker | Pactflow (hosted), Pact Broker (open source) | | Schema-first | OpenAPI + Schemathesis / Dredd, GraphQL Inspector | | gRPC | protobuf compatibility tooling (buf) | ### Adoption path 1. **Day 0**: schema-first (OpenAPI / Zod / Protobuf) covers most needs. 2. **Microservice ≥ 3**: introduce Pact for the most-contended consumer-provider pairs. 3. **Microservice ≥ 10**: broker; can-i-deploy gates. 4. **Many consumers**: consumer-driven Pact across the mesh. Don't adopt Pact for a monolith. Wait for genuine consumer-provider distance. ### See also - [`../architecture/contracts-zod-pattern.md`](../architecture/contracts-zod-pattern.md) — single-repo contract discipline. - [`../architecture/api-versioning-pattern.md`](../architecture/api-versioning-pattern.md) — schema evolution rules. - [`test-pyramid.md`](./test-pyramid.md) — where contract tests fit. - [`ci-cd-pipeline-pattern.md`](./ci-cd-pipeline-pattern.md) — can-i-deploy gate in pipeline. ==== https://playbook.agentskit.io/docs/pillars/quality/cost-optimization-pattern --- title: 'Cost Optimization Pattern (FinOps)' description: 'How to control cloud spend without micromanaging every commit.' --- # Cost Optimization Pattern (FinOps) How to control cloud spend without micromanaging every commit. ## TL;DR (human) Cloud spend without FinOps doubles every 18 months on autopilot. Discipline: per-team budget; per-tenant attribution; per-workload right-sizing; commitments + spot for predictable load; caching + query budgets; CI runs cost-aware too. The goal is "spend roughly what we said we'd spend" — not minimise at all cost. ## For agents ### Three FinOps phases (the FinOps Foundation framework) | Phase | Question | Tools | |---|---|---| | **Inform** | Where is the money going? | Cost dashboards; per-service / per-team / per-tenant attribution | | **Optimize** | What can we cut without harm? | Right-sizing; commitments; spot; cache; query reduction | | **Operate** | How do we keep it that way? | Budgets; alerts; per-PR cost gates; FinOps rituals | Most teams skip straight to Optimize; that's wrong. Inform first; without attribution, optimization is guesswork. ### Inform — cost attribution Every dollar should answer: - **Service**: which microservice / Lambda / managed service. - **Environment**: prod / staging / dev. - **Team**: who owns it; who reviews bills. - **Tenant** (multi-tenant systems): which customer drives the cost. - **Feature / surface** (optional but powerful): which product surface. Achieved via: - **Cloud tags / labels**: applied to every resource at creation time; enforced via IaC (Terraform, Pulumi, CDK). - **Per-request tagging**: spans / logs / metrics carry tenant + service tags. - **Cost allocation reports**: cloud-native (AWS Cost Explorer, GCP Billing, Azure Cost Management) + per-tenant rollup. Untagged resources = mystery costs. Hard rule: no untagged resources in production. ### Per-tenant attribution In multi-tenant SaaS, per-tenant cost drives: - **Pricing**: usage-based or tier-based pricing depends on cost knowledge. - **Customer success**: tenants spending 10× more than they pay are flight risks (acquisition cost will outweigh). - **Capacity planning**: who would grow + how much. - **Quota tuning**: where to put limits. Computed from observability tags (per [`observability-pattern.md`](./observability-pattern.md)). Roll up nightly into a per-tenant cost table. ### Right-sizing Most workloads over-provision. Symptoms: - CPU steady < 30%. - Memory steady < 50%. - Network rarely saturated. Right-sizing process: 1. Measure: 30+ days of utilisation per instance / service. 2. Recommend: smaller instance class; lower memory; fewer replicas. 3. Stage: change in staging; measure. 4. Promote: change in production with rollback path. Hold a reasonable cushion (50–70% utilisation steady-state; lower for spiky workloads). Auto-scaling helps but only when: - Cold-start is acceptable (sub-minute scale-up). - Stateless workload. - Predictable load shape. ### Commitments + spot Cloud providers reward predictability: - **Reserved instances / commitments** (1y, 3y): 30–70% off list price. - **Spot instances**: 60–90% off; reclaimed on short notice. Mix: - **Steady baseline**: covered by commitments (~70% of capacity). - **Burst above baseline**: spot or on-demand. - **Stateful / critical**: on-demand or reserved; never spot. Commitments are a forecasting bet. Under-commit and miss the discount; over-commit and pay for unused capacity. Default to under-committing. ### Query + storage budgets In data-heavy systems, database cost often dominates compute. Per-endpoint discipline (extends [`performance-budgets-pattern.md`](./performance-budgets-pattern.md)): - **Query count budget** per request (N+1 detection: > 20 = probable bug). - **Bytes scanned budget** per request (avoid full-table scans). - **Result size budget** (paginate everything; max 100 rows per page default). - **Cold storage tier** for data > 90 days unused. - **Compression**: enable everywhere it pays (most columnar stores). Per-tenant query budgets prevent noisy-neighbor cost spikes: - Max query CPU-time per tenant per minute. - Max bytes scanned per tenant per hour. - Circuit-break at limit; surface as `QUOTA_EXCEEDED` (see [`../security/multi-tenant-isolation-pattern.md`](../security/multi-tenant-isolation-pattern.md)). ### Egress + data transfer Often the surprise on cloud bills: - Cross-region transfer: usually expensive. - Egress to internet: expensive at scale. - Cross-AZ transfer: sometimes free, sometimes not. Mitigations: - Keep data hot in the same region / AZ as the consumer. - CDN for public assets (one-time push; cheap edge serving). - Avoid cross-region replication for non-critical data. - Compress on the wire. ### Background jobs + queues Cheaper than synchronous: - Async background jobs scale on cheaper compute (spot OK). - Queues buffer bursts; smooth provisioning. - Job retries are cheap when the work is idempotent. Discipline: - Idempotency-key on every job (replays don't double-charge). - Dead-letter queue for permanently failing jobs. - Visibility into queue depth + worker utilisation. ### Cache hit rate A cache pays for itself when: - Hit rate > 60% in steady state. - Origin compute / DB cost per request > cache cost per request. Measure cache hit rate per cache; track over time. Hit rate drops are signals (key churn, app pattern change, eviction pressure). See [`../architecture/distributed-data-pattern.md`](../architecture/distributed-data-pattern.md) for cache tiers. ### Dev / CI cost Often overlooked: - **CI minutes**: cache aggressively (per [`ci-cd-pipeline-pattern.md`](./ci-cd-pipeline-pattern.md)). - **Per-PR ephemeral environments**: convenient but expensive; lifecycle them (auto-tear-down after N days). - **Dev databases**: long-running instances; sleep / terminate on no-activity. - **Build artefact storage**: tier old artefacts to cold; expire after N versions. A 10× cost gap exists between "every team has its own everything 24/7" and "shared dev infrastructure with lifecycle policies". ### Cost gates Beyond budgets, per-PR cost signals: - **Bundle size increase** → bandwidth + CDN cost. - **New cloud resources** in IaC diff → manual review. - **New paid service dependency** → PR comment with monthly estimate. - **Performance regression** → potentially higher per-request cost. These extend the gate suite (see [`quality-gates-pattern.md`](./quality-gates-pattern.md)). ### FinOps rituals | Cadence | Activity | |---|---| | Daily | Cost-anomaly alerts trigger on spike | | Weekly | Top spenders dashboard reviewed | | Monthly | Per-service / per-team budgets reviewed | | Quarterly | Commitments + right-sizing review | | Annually | Cloud-provider contract negotiation; multi-cloud strategy review | ### Anomaly detection A 2× cost spike in 24h means something changed. Possible causes: - New deploy with a query regression. - A customer's traffic spike (good or bad). - A bug producing infinite retries. - A test inadvertently shipped that hits an expensive path. - Account compromise (cryptominer; spam). Anomaly alert routes to the team that owns the service. SEV depends on magnitude: - 1.5× = warn. - 3× = SEV-3. - 10× = SEV-1 (probable runaway or compromise). ### Cost-per-X metrics Useful tracking metrics: - **Cost per active user** (DAU / MAU). - **Cost per request**. - **Cost per tenant** (per pricing-tier). - **Cost per transaction** (for products with discrete units of value). Surface these in dashboards alongside business metrics. Engineering leadership reasons about cost in product terms. ### Multi-cloud caveat Multi-cloud is sometimes proposed for cost. Reality: - Cross-cloud egress is expensive; data gravity locks you in. - Operations cost (per-cloud expertise, tooling) often exceeds savings. - Single-cloud + multi-region usually delivers most of the resilience. Adopt multi-cloud for sovereignty / vendor risk / specific service needs — not for cost. ### Common failure modes - **No tagging**. Cost mystery; cannot attribute. → Enforce in IaC. - **Over-provisioning on autopilot**. Auto-scaling configured min ≥ peak; never scales down. → Right-size min. - **No budgets**. Bills surprise leadership. → Per-team budget; alerts at 50/75/90%. - **Cost work treated as separate from engineering**. → Engineers see their cost; act on it. - **Optimisations that break SLOs**. Cost down; quality down. → Cost is constrained-by SLO, not above it. - **Spot for stateful**. Reclaimed mid-write. → On-demand for state; spot for stateless. - **Cross-region traffic accidental**. Service in region A calls DB in region B. → Network policy + alert. ### Tooling stack (typical) | Concern | Tool | |---|---| | Cloud-native cost | AWS Cost Explorer + Budgets, GCP Billing, Azure Cost Management | | Per-resource analysis | Vantage, Infracost, CloudHealth, Spot.io | | Per-Kubernetes pod cost | Kubecost, OpenCost | | Anomaly detection | CloudZero, native cloud anomaly alerts | | Per-tenant attribution | In-house roll-up from observability tags | | Right-sizing recommendations | AWS Compute Optimizer, GCP Recommender | | FinOps governance | FinOps Foundation framework | ### Adoption path 1. **Day 0**: tag everything; one budget per environment. 2. **Month 1**: per-service attribution; top-spenders dashboard. 3. **Quarter 1**: right-sizing review; first commitments / reservations. 4. **Quarter 2**: per-tenant attribution; cost-per-X dashboards. 5. **Quarter 3+**: PR-level cost signals; cost-aware feature design. 6. **Mature**: FinOps team / role; engineering OKRs include cost. ### See also - [`performance-budgets-pattern.md`](./performance-budgets-pattern.md) — performance and cost overlap heavily. - [`observability-pattern.md`](./observability-pattern.md) — tags drive cost attribution. - [`../security/multi-tenant-isolation-pattern.md`](../security/multi-tenant-isolation-pattern.md) — per-tenant quotas. - [`../architecture/distributed-data-pattern.md`](../architecture/distributed-data-pattern.md) — cache + replica costs. - [`ci-cd-pipeline-pattern.md`](./ci-cd-pipeline-pattern.md) — CI cost. ==== https://playbook.agentskit.io/docs/pillars/quality/mutation-testing-pattern --- title: 'Mutation Testing Pattern' description: 'How to score whether your unit tests actually catch bugs, beyond what coverage tells you.' --- # Mutation Testing Pattern How to score whether your unit tests actually catch bugs, beyond what coverage tells you. ## TL;DR (human) After the unit suite stabilises, run a mutation tool (Stryker for JS/TS, mutmut for Python, similar for other langs). It introduces small bugs ("mutants") into the source and re-runs the tests; surviving mutants reveal tests that pass on bad code. Kill survivors by adding the missing assertion. Use on stable utility modules first; not the whole repo. ## For agents ### Why mutation Coverage tells you which lines ran. It does not tell you whether the test would catch a bug in those lines. Example: a test that calls a function and never asserts on the return value has 100% coverage of the function's lines but catches zero bugs in them. Mutation testing surfaces this. Mutation introduces typed bugs (mutants): - `>` becomes `>=` - `+` becomes `-` - `if (x)` becomes `if (!x)` - `return foo` becomes `return null` - a string literal changes - a function body is replaced with `return undefined` Each mutant is then evaluated: does the test suite catch it? If yes, **killed**. If no, **survived** — a real gap. ### When to introduce mutation Not on day one. Mutation is expensive (runtime: minutes to hours) and produces noise on a young codebase. Introduce when: - Unit suite is stable (passes consistently, no flakes). - Coverage is already high (≥ 80% per package). - The code under test is *production-critical* — security, billing, audit, contracts. ### Scope, not whole-repo Run mutation on one package or one module at a time. Whole-repo mutation runs are usually impractical (hours of runtime; result fatigue). Pick targets in this order: 1. Contract / schema packages. 2. Error-model code. 3. Auth / security guards. 4. Billing / cost calculations. 5. Audit ledger / append-only stores. UI components are usually not worth mutating — their behavior is verified by E2E and visual regression at lower cost. ### Reading the report Output: a mutation score (killed / total) per file + the surviving mutants with diff snippets. Interpret: - **High score (≥ 80%)**: tests catch most bugs in this code. Good. - **Low score (< 60%)**: tests run the code but do not assert on its behavior. Add assertions. - **Survivors clustered in one function**: that function is undertested. Add targeted tests. - **Survivors at error paths**: the error-path tests don't assert on the error code. See [`universal.md`](./universal.md) Rule 4. - **Equivalent mutants** (mutants that produce identical observable behavior): cannot be killed by definition. Mark and move on. ### Killing survivors For each surviving mutant: 1. Read the diff. Understand the bug. 2. Identify the test that should have caught it. 3. Add the missing assertion. Often: assert on the *return value*, not just that the function was called. 4. Re-run mutation; confirm killed. Do not add tests *to kill the mutant for its own sake*. The goal is "the test now asserts on real behavior that matters". A test added solely to kill a mutant, with no real behavioral claim, is noise. ### Equivalent mutants Some mutants are semantically equivalent to the original. Example: `const a = b; return a` → `return b`. They cannot be killed by any test. The mutation tool may flag many of these. Maintain an allowlist file mapping `\:\: \` → ignored. Treat allowlist growth as a code smell — sometimes the code itself can be simplified to avoid the equivalence. ### Performance discipline - Cache mutation results between runs where source has not changed. - Run mutation on changed files in CI, full sweep nightly / weekly. - Mutation does not gate PRs; it gates *releases* — fail release if mutation score regressed > N%. ### Common failure modes - **Mutation on day one.** Score is meaningless because the suite is incomplete. → Stabilise the suite first. - **Whole-repo mutation.** Run takes 6 hours; report is fatigue. → Scope. - **"Killing the mutant" instead of "asserting on real behavior".** Adds noise. → If the kill requires a contorted assertion, the bug the mutant simulates is probably not worth catching. - **Equivalent mutants treated as real survivors.** Inflated score. → Allowlist, with reason. - **Mutation report nobody reads.** Score drifts down. → Make the score visible in the release-gate report. ### Tools by language | Language | Tool | |---|---| | JS / TS | Stryker, StrykerJS | | Python | mutmut, cosmic-ray | | Java / Kotlin | PIT (PITest) | | C# | Stryker.NET | | Rust | cargo-mutants | | Go | go-mutesting | ### See also - [`universal.md`](./universal.md) — Rule 2 (per-package coverage), Rule 4 (codes not messages). - [`test-pyramid.md`](./test-pyramid.md) — coverage as a precondition. - [`sanity-pattern.md`](./sanity-pattern.md) — mutation score is a sanity metric. ==== https://playbook.agentskit.io/docs/pillars/quality/observability-pattern --- title: 'Observability Pattern' description: 'How to know what the system is doing in production, beyond ''tests passed''.' --- # Observability Pattern How to know what the system is doing in production, beyond "tests passed". ## TL;DR (human) Three signals: **metrics** (counters/gauges/histograms, low cardinality), **logs** (events, structured, queryable), **traces** (request spans across services). Define SLOs (Service-Level Objectives) that capture user-perceived correctness; alert on SLO burn rate, not on noisy thresholds. Per-tenant attribution is mandatory in multi-tenant systems. ## For agents ### The three signals | Signal | Question it answers | Storage shape | Volume | |---|---|---|---| | **Metrics** | "How is the system trending?" | Time series, aggregated | Low (cardinality controlled) | | **Logs** | "What exactly happened on this request?" | Structured events | High (sampled / retained per class) | | **Traces** | "Where did the time / failure go in this request?" | Spans + dependencies | Medium (often sampled) | You need all three. Each answers questions the others cannot. ### Metrics — what to collect **Per service, default set**: - **RED**: Rate (requests/s), Errors (errors/s), Duration (latency histogram). - **USE**: Utilization (CPU/mem/IO %), Saturation (queue depth), Errors (system-level). **Per business event**: counter per meaningful product event (`user.invited`, `flow.executed`, `payment.charged`). These drive product metrics + sanity dashboards. **Cardinality discipline**: metric labels should be low-cardinality. `service` + `method` + `status` is fine. `user_id` as a label is fatal — every user explodes the metric count. Use logs / traces for high-cardinality dimensions. ### Logs — what to log Structured JSON, not free-form strings: ```ts logger.info("user.invited", { workspaceId, // multi-tenant attribution inviterId, inviteeEmail: "", // PII redacted requestId, durationMs: 47, }); ``` **Required fields on every log**: - `level` (info / warn / error / debug). - `tag` (the source component). - `timestamp` (ISO-8601 UTC). - `requestId` (correlates with traces). - `workspaceId` / `tenantId` (multi-tenant attribution). **What to log** (good signal): - Boundary crossings (request enters / exits). - Business events (the named events above). - Recoverable errors (with `cause`). - State transitions. **What NOT to log**: - Routine per-row reads. - Inside hot loops. - Anything with raw PII / secrets — redact at the logger. ### Traces — what to trace A trace is a tree of spans for one request, across services / async boundaries. **Span at**: - Every boundary (HTTP / RPC / IPC). - Every external call (DB query, third-party API, message-bus publish). - Significant in-process operations (a long parse, an expensive computation). **Span attributes**: - Service + method name. - Status (ok / error). - Duration. - Request-id propagated across boundaries. - Multi-tenant attribution. Sampling: 100% of error traces; per-tenant sampling of successful traces (e.g. 1%). Critical paths (payment, security) sampled higher. ### Correlation The unifying field is `requestId`. Every signal carries it: - Logs include `requestId`. - Traces use `requestId` as the trace id. - Metric exemplars (when supported) link to a representative trace via requestId. From a single user-reported issue: read the logs by requestId → jump to the trace → see the metric at that time. Five minutes of triage, not an hour. ### SLOs and SLIs **SLI (Service-Level Indicator)**: a measurable thing. "p95 latency of `users.list`". "Error rate of `payments.charge`". **SLO (Service-Level Objective)**: a target. "p95 < 200ms over rolling 30 days". "Error rate < 0.1% over rolling 7 days". **SLA (Service-Level Agreement)**: the contractual version of an SLO with consequences. Usually weaker than internal SLOs (you pad internally). Pick SLIs per **user journey**, not per service. The user does not care that `users-service` is fast if `auth-service` is slow blocking their login. Example SLO catalogue: | Journey | SLI | SLO | |---|---|---| | Login | p95 end-to-end latency | < 1s over 30 days | | Login | success rate | > 99.9% over 7 days | | Run flow | p95 dispatch latency | < 500ms over 30 days | | Run flow | success rate (excluding user errors) | > 99.5% over 7 days | | Page load (dashboard) | p95 TTFB | < 800ms over 30 days | ### Error budget For each SLO, the **error budget** is what is allowed to fail. 99.9% / 30 days = 43 minutes of badness allowed. When the error budget burn rate is high (burning a month's budget in a day), alert. When the budget is exhausted, freeze risky changes (feature rollouts, infra migrations) until budget recovers. Error budget is the framework for negotiating reliability vs feature velocity: - Budget intact → ship features fast. - Budget low → focus on reliability. ### Alerting Alert on **user-impacting** failures, not on every anomaly: - High burn rate on an SLO (you'll exhaust the budget within hours). - Cross-cutting saturation (CPU 95% on every node). - Specific catastrophic events (audit ledger verification failed, vault unreachable, region down). Anti-alerts (avoid): - "Error count > 5 in 1 minute" — noise, churn. - Every individual ERROR log line. - Every transient latency spike. Alerts should wake someone. If they would not be actionable at 3 AM, they should not page. ### Dashboards Per service: - RED metrics. - USE metrics. - Top business events (counts per minute). - SLO burn-rate. Per team: - The SLOs they own. - Recent incident burndown. - Top error sources (by count, by user impact). Per tenant (for support): - Their request rate, error rate, p95 latency. - Their quota usage. ### Cost Observability is expensive. Discipline: - **Metrics**: low cardinality; aggregate at source where possible. - **Logs**: structured + sampled; retention tiered (full for 7 days, sampled for 90, cold for 1 year). - **Traces**: tail-sampled (keep error traces in full; sample success). Forecast your observability bill alongside your infra bill. Surprise observability costs are common. ### Multi-tenant attribution (mandatory) Every signal in a multi-tenant system carries the tenant id. Support runs queries scoped to one tenant. Cost attribution per tenant flows from this. If you cannot attribute a metric / log / trace to a tenant, you cannot: - Help that specific customer. - Bill that customer (cost-based pricing). - Detect noisy-neighbor effects. - Honour DSAR (delete that tenant's logs). ### Common failure modes - **High-cardinality metric label.** Time-series DB blows up. → User id in logs/traces, not metric labels. - **Free-form log messages.** Cannot query. → Structured logs. - **Alerts on every error.** Pager fatigue; real alert ignored. → Alert on burn rate / impact. - **No trace correlation.** Request fails; logs are scattered; no causality. → `requestId` everywhere. - **No SLOs.** "Is the system OK?" answered by feel. → Define + dashboard + alert. - **Tenant attribution missing.** Cannot help a specific customer. → Mandatory tag on every signal. - **Observability stack costs as much as the product.** Sampling + retention not tuned. → Tiered retention; sampling. ### Tooling stack (typical) | Concern | Tool | |---|---| | Metrics | Prometheus, Datadog, Cloudwatch, Grafana Cloud | | Logs | Loki, Elastic, Datadog, Cloudwatch Logs | | Traces | OpenTelemetry + Jaeger / Tempo / Datadog APM | | Dashboards | Grafana, Datadog | | Alerting | Alertmanager, PagerDuty, Opsgenie | | Errors / exceptions | Sentry, Rollbar | | RUM (real-user monitoring) | Datadog RUM, Sentry, NewRelic | OpenTelemetry as the **instrumentation standard** lets you swap backends. ### See also - [`universal.md`](./universal.md) — gates produce actionable signals; observability extends the principle to runtime. - [`performance-budgets-pattern.md`](./performance-budgets-pattern.md) — perf budgets are derived from observability data. - [`chaos-engineering-pattern.md`](./chaos-engineering-pattern.md) — observability is a precondition. - [`../security/audit-ledger-pattern.md`](../security/audit-ledger-pattern.md) — distinct from observability (compliance vs operations). ==== https://playbook.agentskit.io/docs/pillars/quality/performance-budgets-pattern --- title: 'Performance Budgets Pattern' description: 'How to keep ''it''s fast enough'' from drifting into ''why is it slow?''' --- # Performance Budgets Pattern How to keep "it's fast enough" from drifting into "why is it slow?" ## TL;DR (human) Performance is a budget, not an afterthought. Three classes of budget: **bundle** (bytes shipped), **latency** (p50 / p95 / p99 per surface), **resource** (queries, allocations, cache hits). Each has a target + a regression gate. Performance work happens on the SLOs that move the needle, not on micro-optimisations that flatter benchmarks. ## For agents ### Three budget classes | Class | Examples | Where measured | |---|---|---| | **Bundle size** | JS bundle per route, total page weight, image weight | Build time | | **Latency** | p50 / p95 / p99 for HTTP, RPC, DB queries | Production (per [`observability-pattern.md`](./observability-pattern.md)) | | **Resource** | Queries per request, allocations per request, cache hit rate | Production + load tests | Each has a target. Each has a regression detector. ### Bundle size budgets For web apps, per-route budgets: - **JS** (gzipped): home page ≤ 100 KB; authenticated app shell ≤ 300 KB; per-route lazy chunks ≤ 50 KB. - **CSS** (gzipped): per page ≤ 30 KB. - **Images**: hero images ≤ 100 KB; thumbnails ≤ 10 KB; consider WebP / AVIF. - **Fonts**: subset; max 2 weights × 1 family; preload critical. - **Total page weight**: ≤ 1 MB above the fold. Gates: - Per-route bundle size measured at build (`size-limit`, `bundlewatch`, framework-native budgets). - CI fails if any route exceeds budget. - Shrink-only baseline for established codebases. Recipe: route-level code splitting; dynamic imports for non-critical features; tree-shaking; dead-code elimination. ### Latency budgets Per user-facing surface: | Surface | p95 latency budget | |---|---| | Auth (login flow) | < 500 ms | | Dashboard initial load | < 1 s | | Standard list-fetch | < 300 ms | | Write (form submit) | < 500 ms | | Search (interactive) | < 200 ms | | Background dispatch (start a job) | < 1 s | | Long-running job (the dispatch, not the work) | < 200 ms | Budgets vary by product; calibrate based on user research + competitor benchmarks. Per-tier breakdown (the budget allocated across layers): ``` Total p95 1000ms budget ├── DNS + TLS + connection: 100 ms (CDN / edge) ├── Server processing: 400 ms (handler + queries) ├── Response payload + transit:200 ms (size + network) └── Browser parse + render: 300 ms (HTML, JS, CSS, paint) ``` Budgets at each layer compose. Blowing the budget at one layer requires shrinking another. ### Resource budgets **Per request**: - DB queries: ≤ 10 per request (N+1 detection: > 20 queries = probable N+1). - DB query time: ≤ 100 ms aggregate per request. - Cache hit rate (when caching is in play): > 80% in steady state. - Allocations: track if memory pressure; flag specific endpoints with high allocation rate. **Per worker / job**: - Memory: < 75% of provisioned limit in steady state (room for spikes). - CPU: < 70% in steady state. ### Where the budget enforcement lives **Build time**: bundle-size gate. Hard fail. **CI integration test**: query-count gate per endpoint. Synthetic load tests on staging produce p95 measurements. Fail PR if a measured endpoint regressed > 10%. **Production observability**: SLO burn rate on latency budgets (per `observability-pattern.md`). Alert when budget burns faster than expected. ### Anti-patterns to detect | Pattern | Signal | |---|---| | N+1 query | Per-request query count linearly tied to result count | | Synchronous fanout to N services | p95 increases with N | | Hot loop with allocation | GC pressure spikes per request | | Unbounded result set | Latency increases over time as data grows | | Missing index | DB CPU climbs; specific query slow | | Synchronous external call | Tail latency dominated by third-party | | Render-blocking JS | First Contentful Paint > 2s | | Large image not lazy-loaded | Above-fold image stalls render | Each has a recipe to fix; agents can match symptom to recipe. ### Performance work prioritisation Not all performance issues are worth fixing. Prioritise by: 1. **User impact**: how many users hit it; how often? 2. **Budget burn**: is the SLO at risk? 3. **Cost**: is the slow path also expensive (queries, compute)? Anti-prioritisation: optimising a path that runs once per week to save 5 ms is noise. Optimising the dashboard load that every user hits 100×/day is high-value. ### Load testing Synthetic load tests, run periodically: - **Soak tests**: steady load for hours; surfaces memory leaks, connection-pool exhaustion, cache eviction. - **Spike tests**: sudden 10× load; surfaces rate-limit gaps, queue-depth blow-ups. - **Ramp tests**: gradual climb; surfaces the point where p95 explodes. Tools: k6, Vegeta, Locust, Gatling. Load tests run against staging with production-like data shapes. CI-integrated tests for critical paths; longer tests pre-release. ### Real-user monitoring (RUM) Production gives the truth synthetic tests cannot: - Per-user p50/p95/p99 latency. - Geographic breakdown. - Device breakdown (mobile / desktop / connection class). - Per-route Core Web Vitals (LCP, INP, CLS). RUM data feeds SLO calculation. Synthetic load tests catch what RUM will reveal; RUM catches what synthetic missed. ### Performance as a feature Communicating performance to users: - Optimistic UI: render the new state immediately; reconcile after. - Skeleton loading: shows structure within 100ms (per [`../ui-ux/universal.md`](../ui-ux/universal.md) Rule 4). - Streaming results: don't wait for the full payload to render. - Background work + progress: tell the user it is happening; estimate completion. Perceived performance > measured performance. A 2-second operation that feels instant beats a 1-second operation that feels slow. ### Common failure modes - **No budget at all.** "It's fast enough." Until it isn't. → Document budgets; gate regressions. - **Budgets that no one reviews.** Budget creeps; nobody notices. → Budget review at release time. - **Micro-optimisations that don't move the needle.** Optimised a 5 ms function nobody hits. → Measure user-perceived; prioritise by impact. - **Bundle-size gate without per-route detail.** Total goes up by 2 KB; you don't know which route. → Per-route budgets. - **Synthetic-only measurement.** Tests say fast; users say slow. → RUM mandatory; sample real users. - **Performance work that breaks tests.** Speed at expense of correctness. → Performance budget is part of the contract; correctness is not negotiable. ### Tooling stack (typical) | Concern | Tool | |---|---| | Bundle analyzer | webpack-bundle-analyzer, source-map-explorer, `next bundle` | | Bundle gate | size-limit, bundlewatch, framework-native | | Synthetic load | k6, Vegeta, Locust | | RUM | Sentry, Datadog RUM, NewRelic Browser, CrUX | | Profiling (server) | clinic.js, perf, py-spy, async-profiler (Java) | | Profiling (web) | Chrome DevTools Performance, React Profiler | | Core Web Vitals | Lighthouse CI, web-vitals lib | ### See also - [`observability-pattern.md`](./observability-pattern.md) — measurement infrastructure. - [`../architecture/anti-overengineering.md`](../architecture/anti-overengineering.md) — premature optimisation reminder. - [`../architecture/distributed-data-pattern.md`](../architecture/distributed-data-pattern.md) — caching tiers; replica routing. - [`chaos-engineering-pattern.md`](./chaos-engineering-pattern.md) — load tests + fault injection. ==== https://playbook.agentskit.io/docs/pillars/quality/pre-push-pattern --- title: 'Pre-push Pattern' description: 'The safety net between local changes and CI.' --- # Pre-push Pattern The safety net between local changes and CI. ## TL;DR (human) Pre-push runs structural gates + typecheck + build — fast enough to be tolerable (target ≤ 30s), thorough enough to catch what pre-commit missed. It does **not** run lint or full tests. The goal is "catch structural drift before CI burns minutes", not "be CI". ## For agents ### What runs | Tier | Pre-commit | Pre-push | CI | |---|---|---|---| | File-size (changed files) | ✓ | | ✓ (all files) | | Secrets scan | ✓ | ✓ | ✓ | | Raw-error scan | ✓ | ✓ | ✓ | | All structural gates | | ✓ | ✓ | | Typecheck | | ✓ | ✓ | | Build | | ✓ | ✓ | | ADR / RFC integrity | | ✓ | ✓ | | Lint | | | ✓ | | Unit tests | | | ✓ | | Integration tests | | | ✓ | | E2E | | | ✓ | | Sanity audit | | | ✓ (scheduled) | | Mutation | | | ✓ (periodic) | Why lint and tests are pre-push **not**: - Lint is slow on a big repo and CI catches it anyway. - Full tests take minutes; nobody waits for them at push time. - Pre-push must stay tolerable or agents `git push --no-verify`. ### Runtime budget - Total pre-push: ≤ 30s on a warm machine. - If you bust the budget: profile; move slow checks to CI. - Gates that grow over time: cache aggressively; scope-narrow to changed files where the gate semantics allow. ### Implementation Husky / lefthook / native git hooks. Hook script: ```bash #!/usr/bin/env bash set -e pnpm check:quality-gates --fast # structural gates, no baselines regen pnpm check:adr-rfc # ADR/RFC integrity pnpm typecheck # tsc -b pnpm build # turbo build, cached ``` Set `-e` so a failure stops the push. The hook exits non-zero on any failure; git refuses the push. ### Concurrent-merge protection Pre-push runs against `HEAD`, not against `origin/main`. If main has moved since you forked, your pre-push may pass while CI fails because your branch is out of date. Defense: 1. Before push: `git fetch && git status -uno` — confirm you're not behind main. 2. If behind: rebase, re-run pre-push. 3. The pre-push hook itself can perform this check and refuse to push when behind — opt-in based on team comfort. ### Bypass policy `git push --no-verify` bypasses the hook. It exists for emergencies. Conventions worth adopting: - If you bypass, the PR description must say why ("hook misfired; verified manually"). - A CI job verifies that bypassed PRs still pass all pre-push checks. Bypass surfaces the failure in CI instead of locally. - Repeated bypass without justification is a process smell; investigate the hook (probably too slow or producing false positives). ### Per-package vs whole-repo In a monorepo, pre-push can be scoped to the packages your branch touches. Faster, but riskier — a structural drift in a peer package might not show until CI. Conservative default: whole-repo gates. Optimize only if you measure them slow. ### Common failure modes - **Hook takes 90 seconds.** Agents bypass. → Profile; move slow checks to CI. - **Hook runs full test suite.** Agents bypass. → Tests are CI, not pre-push. - **Hook depends on dev-only env (e.g. `.env.local`).** Fails on fresh checkouts. → Hooks read from committed config only. - **Hook silently auto-fixes things.** Surprise commits during push. → Hooks check, do not modify. - **Hook output is hundreds of lines.** Agent skims. → Concise output; full report on demand via `--verbose`. ### Hooks that auto-regenerate files A common trap: a pre-commit / pre-push hook that re-runs a code generator (status file, types from schemas) and stages the result. This makes for surprising commit contents and conflicts with rebase. Avoid auto-staging. If a generator needs to run, the hook fails and tells the agent to run the generator + amend the commit. Agent control beats hook magic. ### Failure recovery If pre-push fails on a single small drift: 1. Read the message. It is actionable. 2. Fix in the smallest possible diff (often a one-line correction). 3. Amend the commit (`git commit --amend --no-edit`), re-run hook. 4. Push. If pre-push fails on something you cannot fix in 5 minutes: 1. Stash, investigate root cause. 2. Open a ticket if it's a hook bug or pre-existing main red. 3. Bypass with justification, file follow-up. ### See also - [`universal.md`](./universal.md) — Rule 5 (three-tier split). - [`quality-gates-pattern.md`](./quality-gates-pattern.md) — what `check:quality-gates --fast` does. - [`../ai-collaboration/concurrent-agent-pattern.md`](../ai-collaboration/concurrent-agent-pattern.md) — stash-verify-red protocol. ==== https://playbook.agentskit.io/docs/pillars/quality/product-analytics-experimentation-pattern --- title: 'Product Analytics + Experimentation Pattern' description: 'How to measure what users do, run experiments cleanly, and let data drive product decisions — without inviting bias or noise.' --- # Product Analytics + Experimentation Pattern How to measure what users do, run experiments cleanly, and let data drive product decisions — without inviting bias or noise. ## TL;DR (human) Product analytics is distinct from observability (system health). Two surfaces: **event tracking** (what users do) + **experiments** (controlled tests of variants). Discipline: a tracking schema; consent + privacy; pre-registered hypotheses; minimum sample size; honest stop conditions. Avoid HARKing (hypothesizing after results known) and p-hacking. ## For agents ### Analytics vs observability | Dimension | Observability | Product analytics | |---|---|---| | Question | Is the system healthy? | What are users doing? | | Audience | Engineers, on-call | Product, growth, leadership | | Signal | Metrics, logs, traces | Events, funnels, cohorts | | Tooling | Prometheus, Datadog, OTel | Mixpanel, Amplitude, PostHog, Segment | | Cardinality | Low (per service) | High (per user, per event) | Different needs, different storage, different teams. Don't conflate. ### Event tracking schema Every tracked event has: ```ts type AnalyticsEvent = { name: string; // "flow.run.started" timestamp: string; userId?: string; // anonymized if consent not granted workspaceId?: string; sessionId: string; properties: Record; context: { userAgent, referrer, locale, ... }; }; ``` Naming convention: `\.\.\` — `flow.run.started`, `checkout.coupon.applied`, `dashboard.tab.opened`. Schema discipline: - Events registered in a central registry (typed; reviewed). - Per-event properties documented. - Renames go through the same deprecation as APIs. - Avoid `event_type=foo, value=bar` pattern; use specific event names. ### Consent + privacy Per [`../security/data-classification-pattern.md`](../security/data-classification-pattern.md): - Tracking pixels / analytics SDK loaded only after consent (EU) or with opt-out availability (CA). - PII never in event properties (no emails, names, raw IPs). - User ID → hash; reversible only with explicit privilege. - DSAR deletion: walks analytics data too. - Cookie banner respects user choice; doesn't dark-pattern. ### Funnels and cohorts Funnel: ordered sequence of events. Measures drop-off. ``` signup.started → signup.email-entered → signup.verified → signup.completed ``` Cohort: group of users sharing an attribute or behavior. Measures retention. ``` "users who completed signup in week N" — track week-N+1, N+2, ... retention. ``` These are the workhorses. Most product metrics come from one or the other. ### Experiments — A/B + multivariate An experiment varies one or more dimensions across user segments; measures impact. **Pre-register**: - Hypothesis: "Reducing form fields from 5 to 3 will increase signup conversion." - Primary metric: signup conversion rate. - Guardrail metrics: completion-of-key-action 7 days later (don't optimise top-of-funnel at the cost of long-term value). - Sample size: how many users per variant for statistical significance. - Duration: how long; minimum 1 week for weekday/weekend effects. - Stop conditions. **Run**: - Assign users to variants deterministically (hash of user-id mod N). - Sticky: same user sees same variant across sessions. - One experiment per metric per surface at a time (parallel experiments confound). - Don't peek + stop early without sequential analysis methods (avoid p-hacking). **Analyze**: - Compute primary + guardrail metrics. - Statistical significance test (frequentist or Bayesian). - Check for sample bias. - Document outcome. **Decide**: - Win → promote. - Loss → kill. - Inconclusive → either more data or kill. ### Sample size + power A common mistake: running underpowered experiments. Minimum detectable effect (MDE) and required sample size are inversely related. For a 1% absolute conversion lift on a 5% baseline, typically ~30,000 users per variant for 80% power. Tools: built-in calculators in analytics platforms; G*Power; in-house pre-flight checks. If you can't get the sample size in a reasonable window: pick a more sensitive metric or larger MDE. ### Statistical anti-patterns | Pattern | Why wrong | Fix | |---|---|---| | **Peeking + stop early** | Inflates false positive rate | Sequential / Bayesian methods; or commit to duration | | **HARKing** (hypothesise after seeing results) | Confirms noise | Pre-register hypothesis | | **p-hacking** (run many metrics; one happens to be significant) | Multiple comparisons; false discovery | Limit metrics; correct for multiple tests | | **Subgroup hunting** | Same as p-hacking | Pre-registered subgroups only | | **Ignoring guardrail metrics** | Optimise locally, lose globally | Always include long-term + counter-balance | | **Stopping at "no significant difference"** | Absence of evidence isn't evidence of absence | Power analysis; effect-size bounds | ### Holdout groups Beyond per-experiment, maintain holdout cohorts: - 1-5% of users never see any experiments. - Lets you measure cumulative impact of all changes. - Catches regressions invisible at experiment-by-experiment level. ### Experiment retirement Like feature flags (per [`../architecture/feature-flags-pattern.md`](../architecture/feature-flags-pattern.md)): - Mandatory `retireAt`. - Decision recorded + variant cleaned up. - Loser variant code deleted. ### North-star metric One metric that captures product success: - DAU / MAU for social products. - ARR for subscription products. - Net retention for SaaS. - Hours of value delivered (custom). Sub-metrics ladder up. Engineering OKRs tie to north-star. ### Privacy + analytics — staying clean - Use anonymized identifiers per user (rotating, salted). - Aggregate before storing where possible (counts, not individuals). - Server-side tracking preferred over client (less ad-block; more reliable). - Self-hosted analytics if data must stay in-house. - DSAR-able: analytics rows tagged with user-id-hash; deletion walks them. ### Common failure modes - **Inconsistent event naming**. Half use snake_case, half camelCase; analyses break. → Registry. - **PII in event properties**. Email as user-id. → Schema review. - **Experiments without pre-registration**. HARK + p-hack rampant. → Pre-register; tool enforces. - **No holdout group**. Local wins; global loss invisible. → Holdouts. - **Tracking after consent withdrawn**. GDPR violation. → SDK respects consent state. - **Experiment retired by removing the loser inline + leaving flag**. Stale flag. → Full retirement. - **Vanity metrics**. Page views; bounce. Not tied to value. → Focus on activation, retention, revenue. ### Tooling stack (typical) | Concern | Tool | |---|---| | Event tracking | Mixpanel, Amplitude, PostHog, Heap, Segment | | Server-side | Segment, Snowplow, RudderStack, in-house | | Experimentation | Statsig, GrowthBook, Optimizely, LaunchDarkly Experiments, in-house | | Data warehouse | Snowflake, BigQuery, Redshift, ClickHouse | | BI | Looker, Tableau, Metabase, Mode | | Funnel + cohort | Built into Mixpanel/Amplitude; or warehouse-native | ### Adoption path 1. **Day 0**: a small event schema (10-20 events covering signup + activation + core actions); consent banner; SDK. 2. **Month 1**: funnels for primary user journeys. 3. **Month 2**: cohort retention dashboards. 4. **Month 3**: first experiment; statistically rigorous. 5. **Month 6**: holdout group; experiment platform. 6. **Year 1+**: north-star + sub-metrics ladder; experimentation as product practice. ### See also - [`observability-pattern.md`](./observability-pattern.md) — observability, distinct from analytics. - [`../architecture/feature-flags-pattern.md`](../architecture/feature-flags-pattern.md) — experiments as flags. - [`../security/data-classification-pattern.md`](../security/data-classification-pattern.md) — privacy classifications. - [`../security/compliance-framework-pattern.md`](../security/compliance-framework-pattern.md) — GDPR / CCPA implications. ==== https://playbook.agentskit.io/docs/pillars/quality/quality-gates-pattern --- title: 'Quality Gates Pattern' description: 'How to bundle structural rules into one fast command an agent runs before every push.' --- # Quality Gates Pattern How to bundle structural rules into one fast command an agent runs before every push. ## TL;DR (human) `pnpm check:quality-gates` (or your equivalent). Orchestrates the structural gates — file size, no `any`, named exports, raw-error scan, tokens, intl, secrets, completeness. Runs in parallel. Each gate is atomic and produces an actionable failure. Total runtime: target < 30s on a warm cache. ## For agents ### The gate set A complete gate suite covers six concerns: | Gate | Pillar | Enforces | |---|---|---| | file-size | architecture / quality | Per-extension line budget, shrink-only baseline | | no-any | architecture | No `any` outside escape-hatched comments | | named-exports | architecture | No `export default` outside framework-mandated files | | raw-error | architecture | No `throw new Error(...)` in boundary files | | tokens | ui-ux | No hex/rgb/hsl/oklch literals, no inline color styles | | native-html | ui-ux | No bare `\`/`\`/etc. in shipped surfaces | | intl | ui-ux | No hardcoded user-visible strings | | secrets | security | No high-entropy strings / PEM blocks / key prefixes | | completeness | quality | No `TODO`/`FIXME`/`throw new Error('not implemented')`/`disabled:true` in shipped surfaces | | pr-intent | governance | Manifest matches diff (CI only; not pre-commit) | | adr-numbering | architecture | ADR sequence integrity (CI only) | | rfc-index | architecture | RFC promotion linkage (CI only) | ### Orchestrator script A single entry: `pnpm check:quality-gates`. It: 1. Discovers the configured gates from a single config file (e.g. `.quality-gates.json`). 2. Runs them in parallel where possible. 3. Aggregates failures into one report with per-gate sections. 4. Exits 0 if all pass; non-zero with summary if any fail. 5. Has flags: `--gate=\` to run just one; `--explain` for fix recipes; `--baseline` to regenerate baselines. Reference impl shape: [`../../scripts/README.md`](../../scripts/README.md). ### Parallelism Most gates are CPU-bound and independent. Run them in parallel; on a typical dev machine, full suite finishes in 30s instead of 5 minutes serial. The exceptions — gates that need a build output (e.g. bundle-size on `core` package) — depend on the build. Sequence: build first, then dependent gates. ### Configuration One config file at repo root. Schema: ```json { "gates": { "file-size": { "enabled": true, "budgets": { ".tsx": 300, ".ts": 500, ".test.ts": 800 }, "baseline": ".file-size-baseline.json" }, "no-any": { "enabled": true, "allowMatchRegex": "// allow-any:" }, "named-exports": { "enabled": true, "exempt": [ "apps/web/app/**/{page,layout,loading,error}.tsx", "**/{tailwind,next,vitest}.config.*" ] }, "raw-error": { "enabled": true, "boundaryPaths": ["packages/*/src/methods/**", "packages/*/src/handlers/**"] } } } ``` Why one file: - Agents see all gate configs in one place. - Reviewers see config changes in one diff. - Disabling a gate is visible — no scattered overrides. ### Adding a new gate 1. Implement: stand-alone script in `scripts/check-\.mjs`. Exits 0/non-zero. Reads its config from `.quality-gates.json`. 2. Action message: when it fails, print file:line + rule + fix recipe. 3. Baseline (if applicable): generate baseline on first run; lock to shrink-only. 4. Register: add to `.quality-gates.json`. 5. Pre-commit: add to the hook if runtime < 1s. 6. Document: one row in [`../../scripts/README.md`](../../scripts/README.md). ### Disabling a gate A failing gate is the gate working. If you need to disable it: - File-level: code comment escape hatch (`// allow-any:`, `// allow-native:`). Counted by a separate gate that fails if the count grows. - Repo-level: `enabled: false` in `.quality-gates.json`. This is a serious change — requires an ADR. Never `eslint-disable-next-line` for structural-gate rules. Use the named escape hatch so the count is tracked. ### Local vs CI parity Same gate, same config, same result locally as in CI. Achievable by: - Pinning the Node / package-manager version (Volta / `.nvmrc` / `packageManager`). - Running gates from the same `pnpm` script. - Avoiding env-dependent behavior in gate scripts. If local says green and CI says red, you have a parity bug. Fix the parity bug, not the gate. ### Performance budget - Whole suite: < 30s on a warm dev machine. - Each gate: < 5s individually. - A gate that gets slow over time: profile it. Often it's reading too many files; cache or scope-narrow. Agents tolerate fast gates and skip slow ones. Keep them fast. ### Common failure modes - **Gate output is just "147 errors found".** Agent disables the gate. → Per-error file:line + rule + fix. - **Composite gate enforcing 4 rules at once.** One rule fires; agent can't tell which. → One gate = one rule. - **Gate config scattered across 6 files.** Disable one rule requires hunting. → One config file. - **Pre-commit takes 15 seconds.** Agents bypass with `--no-verify`. → Move slow gates to pre-push or CI; keep pre-commit fast. - **Gates pass locally, fail in CI.** Parity bug. → Pin versions; run gates from the same script. ### See also - [`universal.md`](./universal.md) — Rule 1 (actionable), Rule 8 (one gate one rule). - [`pre-push-pattern.md`](./pre-push-pattern.md) — where heavier gates run. - [`../../scripts/README.md`](../../scripts/README.md) — gate reference impls. ==== https://playbook.agentskit.io/docs/pillars/quality/sanity-pattern --- title: 'Sanity Pattern' description: 'The cross-cutting audit that catches what individual gates miss.' --- # Sanity Pattern The cross-cutting audit that catches what individual gates miss. ## TL;DR (human) A periodic batch of checks that span concerns — coverage AND honesty AND completeness AND drift — produces one report. CI fails if the report regresses against the last release. Run it on a schedule (nightly), on demand (`pnpm sanity`), and before every release. ## For agents ### Why sanity is separate from gates Gates are atomic, fast, blocking. Each enforces one rule. They run on every commit. Sanity is composite, slower, periodic. It asks questions a single gate cannot: - "Does this package have 95% coverage AND every error code asserted somewhere?" - "Is every screen in the nav covered by at least one E2E test?" - "Does every ADR have a corresponding code surface, or is it tombstoned?" - "Does the for-agents doc for each package mention all its exported methods?" - "Are there any RPC methods registered but not handler-bound?" Each is a *cohesion* check. None of them is a single-rule violation; all of them indicate quiet drift. ### Output The sanity audit produces a single report — `docs/audit/sanity-report.md` or equivalent. Sections: 1. **Per-package coverage cohesion** — coverage % + code-asserted % + uncovered branches. 2. **Per-screen completeness** — E2E ref count + a11y status + intl coverage. 3. **Doc-vs-code drift** — for-agents files that reference removed symbols; symbols with no doc. 4. **Contract-vs-handler completeness** — methods in the registry without a bound handler; handlers without a registry entry. 5. **ADR-vs-code surface** — accepted ADRs whose described surface no longer exists. 6. **Baseline trend** — file-size baseline shrinking? Growing? Stagnant? 7. **Honesty smells** — `TODO`/`FIXME` count, `disabled:true` count, stub-returning methods. ### Cadence - **On demand**: `pnpm sanity` (or equivalent). Any agent / human can run. - **Scheduled**: nightly CI job. Posts the report to a known location. - **Pre-release**: required. The release-gate checklist verifies the report has no regressions. ### Regression detection Each section produces a numeric metric. The report compares to the previous run; CI fails if any metric regressed beyond a threshold. Example metrics: - "Methods registered without handlers": should be 0; if > 0, fail. - "Stub-returning methods in shipped surfaces": baseline N; fail if > N. - "Screens without E2E": baseline M; fail if > M. Like file-size budgets, sanity uses shrink-only baselines. Existing drift is grandfathered; new drift fails. ### Per-pillar sections Each pillar can contribute a section: | Pillar | Sanity contribution | |---|---| | architecture | ADR-vs-surface, contract-vs-handler, package-vs-routing-table drift | | security | un-audit-logged privileged ops, secrets in source, RBAC roles without capabilities | | ui-ux | screens without empty state, intl coverage per locale, token drift | | quality | per-package coverage trend, mutation score trend, flake rate | | governance | tombstoned docs still referenced as active, ADRs with no Status | | ai-collaboration | memories referencing removed paths, slash commands without bodies | ### Implementation shape ``` scripts/ ├── sanity/ │ ├── architecture.mjs # produces sections 4, 5 │ ├── security.mjs # produces honesty smells + RBAC checks │ ├── ui-ux.mjs # screens, intl, tokens │ ├── quality.mjs # coverage cohesion, baseline trends │ ├── governance.mjs # tombstone & ADR cohesion │ └── ai-collab.mjs # memory & slash cohesion └── sanity.mjs # orchestrator, runs in parallel, aggregates report ``` ### Report consumption The report is markdown. It is committed to the repo (or posted as a CI artefact). Readers: - Developers, before opening a PR: are there any easy wins to address? - Reviewers: does the PR fix or worsen drift? - Release managers: is the report clean enough to ship? ### Common failure modes - **Sanity audit nobody reads.** Drift accumulates silently. → Make CI fail on regressions; force a read. - **Sanity audit blocks every PR.** Friction; agents bypass. → Sanity is periodic + pre-release, not per-PR. Per-PR gates are the structural gates. - **One huge report with no per-pillar split.** Cannot triage. → Section per pillar; section per concern. - **Metrics that always say "0" or always say "stable".** Useless signal. → Calibrate; pick metrics that move. - **No baseline; first run fails on accumulated debt.** → Capture baseline on first run; gate to shrink-only. ### See also - [`universal.md`](./universal.md) — Rule 9 (sanity cross-cuts). - [`quality-gates-pattern.md`](./quality-gates-pattern.md) — per-PR enforcement. - [`../../scripts/README.md`](../../scripts/README.md) — gate + sanity reference impls. ==== https://playbook.agentskit.io/docs/pillars/quality/test-pyramid --- title: 'Test Pyramid' description: 'How to mix test types so the cheap ones catch most bugs and the expensive ones cover what only they can.' --- # Test Pyramid How to mix test types so the cheap ones catch most bugs and the expensive ones cover what only they can. ## TL;DR (human) Five tiers, ordered by cost. Spend most of your budget on tiers 1–2. Reserve tier 5 for golden paths and cross-process boundaries. The pyramid is not dogma — it is **cost optimization for catch rate**. ## For agents ### Tiers | Tier | Type | Runtime | What it catches | |---|---|---|---| | 1 | Schema parse / contract | <1 ms each | Wrong shapes, missing fields, type-vs-runtime drift | | 2 | Unit (pure functions, single class) | <10 ms each | Logic errors, off-by-one, edge cases | | 3 | Integration (handler + store + adapter, in-process) | <500 ms each | Boundary mismatches, transaction-ordering, missing wiring | | 4 | Visual regression / a11y | seconds each | Token drift, component layout regressions, a11y violations | | 5 | E2E (real app, real services) | minutes each | Cross-process bugs, real-world flow integrity | ### Where to spend Rough budget for a healthy codebase: - 70% of test count: tier 1–2. - 25%: tier 3. - 4%: tier 4. - 1%: tier 5. If your suite inverts this — 60% E2E, 10% unit — your runtime is long, your signal is flaky, and your debugging surface is huge. ### Which tier catches which bug When a bug is reported, pick the *lowest* tier that can pin it: 1. Could a schema parse test reject the bad input? → Add tier 1 test. 2. Could a unit test fail on the wrong logic? → Add tier 2 test. 3. Does the bug appear only when handler + store interact? → Tier 3. 4. Does the bug show only in the rendered DOM? → Tier 4. 5. Does the bug live in cross-process handoff or browser-only behavior? → Tier 5. Always escalate to the higher tier only after the lower tier cannot pin it. ### Test names A test name reads like a sentence: ``` describe("users.list handler", () => { it("rejects missing workspaceId with VALIDATION_ERROR", ...) it("returns empty rows when no users in workspace", ...) it("respects limit and cursor for pagination", ...) }) ``` Anti-pattern: `it("works")`, `it("test 1")`. Agent-produced tests with these names are a smell — they tested the wrong thing. ### Determinism Every test runs in isolation, in any order, in parallel, with no shared state. - No file system writes outside a per-test temp dir. - No network calls (mock the boundary). - No timer / clock drift (inject the clock). - No global module state. If a test passes in isolation and fails in parallel, the test has hidden global state. Fix the test, not the order. ### Fixtures Fixtures are data, not code. Keep them in `__fixtures__/` directories next to the tests that use them. One fixture per file; descriptive name. When a fixture grows past ~50 lines, ask whether the underlying schema is too lenient. Fixtures that need to encode many edge cases hint at a schema that should reject the edge cases at parse time. ### Coverage interpretation Coverage tells you which lines ran, not whether the tests are good. A 100% coverage suite that never asserts on outputs is worthless. Use coverage to find untested *branches*, then ask: "is the untested branch reachable in production?" If yes, add a test. If no, the branch is dead code; delete it. ### Property-based testing For pure functions with a clear input domain (parsers, serializers, math), add a few property-based tests. They catch edge cases unit tests miss. ``` property("any valid input round-trips through parse + serialize", ...) ``` One property test can replace fifty unit tests. Reserve for high-value boundaries. ### Mutation as a coverage backstop After unit suite stabilises, mutation testing scores its real catch rate. See [`mutation-testing-pattern.md`](./mutation-testing-pattern.md). ### Common failure modes - **Tests that only assert on rendered text.** Break on intl / copy changes. → Assert on structure or codes. - **Tests that mock too deep.** End up testing the mocks. → Mock at the trust boundary only. - **Tests that share fixtures via mutation.** Order-dependent. → Fresh fixtures per test, or immutable fixtures. - **E2E flake "fixed" by a sleep.** Flake hidden, not fixed. → Find the deterministic signal; assert on that. `expect.poll()` / `waitFor()` over fixed sleeps. - **Coverage 95% but error codes never asserted.** The error path is untested. → A separate gate scans tests for `code:` assertions; flags codes that are never asserted. ### See also - [`universal.md`](./universal.md) — Rule 3 (hermetic before E2E), Rule 4 (assert on codes). - [`mutation-testing-pattern.md`](./mutation-testing-pattern.md) - [`../architecture/contracts-zod-pattern.md`](../architecture/contracts-zod-pattern.md) — tier 1 lives here. ==== https://playbook.agentskit.io/docs/pillars/quality/universal --- title: 'Quality — Universal Principles' description: 'How to know agent-produced code works without manually reading every diff.' --- # Quality — Universal Principles How to know agent-produced code works without manually reading every diff. ## TL;DR (human) Nine rules. The goal is **trust the green signal** — when the gates say green, you can ship without re-reading the diff. Agents will produce more code than you can review; the gates are the only thing that scales. 1. Gates produce actionable messages, not boolean failures. 2. Per-package coverage targets, not aggregate. 3. Hermetic tests before E2E. 4. Tests assert on codes, not messages. 5. Pre-commit fast, pre-push thorough, CI complete. 6. Shrink-only baselines for legacy debt. 7. Verify-first before "fixing" a flaky / red signal. 8. One gate = one rule = one fix recipe. 9. Sanity audit cross-cuts what individual gates miss. ## For agents ### Rule 1 — Gates produce actionable messages A failing gate must answer: - Which file, which line. - Which rule was broken. - What the fix looks like. - The escape hatch, if one exists. "Lint failed (147 errors)" is not actionable. "src/api/users.ts:42 — boundary file may not throw raw `Error`; use a typed `AppError` subclass with a stable code. Escape hatch: `// allow-raw-error: \`." is actionable. Agents act on actionable signals. Agents disable or bypass non-actionable ones. **Failure mode prevented:** agents adding broad `// eslint-disable` comments because they could not figure out which specific rule fired. ### Rule 2 — Per-package coverage targets, not aggregate Aggregate coverage hides which packages are well-tested and which are not. Set a coverage threshold per package: - Foundation / contract packages: 95%+ (small surface, high blast radius if broken). - Logic packages: 85%+. - UI / integration packages: 70%+ (some surfaces resist unit testing). CI fails if any package drops below its target. Aggregate is computed for reporting, not for gating. **Failure mode prevented:** a 90% aggregate that hides one package at 30%; agents adding to the strong package because that is where green commits are easy. ### Rule 3 — Hermetic tests before E2E Reproduce bugs in component-level / in-process tests first. E2E only for golden paths. - A failing in-process test takes seconds to run and stays deterministic. - A failing E2E test takes minutes, flakes, and gives you no signal about *where* the failure is. When triaging a bug: 1. Try to reproduce in a unit test against the suspect module. 2. If that's not enough, an integration test wiring stores + handlers in-process. 3. E2E only if the bug is genuinely cross-process (sidecar handoff, network boundary). **Failure mode prevented:** "flaky E2E test" turning into a 4-hour debugging session for something a 30-second component test would have pinned exactly. ### Rule 4 — Tests assert on codes, not messages For typed errors: ``` expect(err.code).toBe("AUTH_REQUIRED"); // ✓ expect(err.message).toContain("Auth"); // ✗ ``` Messages get intl-resolved, get reworded for clarity, drift over releases. Codes are stable contracts. For non-error assertions: prefer asserting structural shape over rendered text where intl is involved. Asserting ` // ✓ right ``` The icon is decorative; the button has a name. #### Form errors ```tsx // ✓ right {hasError && ( {t("form.email.error.required")} )} ``` Required: explicit; aria-invalid: state; aria-describedby: link to the message; role="alert": announces on appearance. #### Modal dialogs ```tsx

{t("confirm.title")}

{t("confirm.description")}

{/* focus trap; Esc closes; focus restores */}
``` Native `\` is increasingly viable; headless libraries (Radix Dialog) wrap with full a11y. #### Loading states ```tsx // ✓ right
{isLoading ? : }
``` Loading is announced; once loaded, polite update doesn't interrupt. #### Skip-to-content link ```tsx
{t("a11y.skip-to-main")}
...
``` Hidden until focused (first Tab); jumps screen-reader past navigation. ### Testing discipline Per UI-touching PR: 1. **Axe scan** (CI-automatic): no critical / serious violations. 2. **Keyboard pass**: Tab through the changed screen; verify reachability + focus + activation. 3. **Screen-reader pass**: spot-check the changed screen with one screen reader. Quarterly: - **Full screen-reader pass**: all primary user journeys. - **Mobile screen-reader pass**: TalkBack (Android) or VoiceOver iOS. - **User testing with disability community**: yields findings automation cannot. ### Common ARIA misuse | Mistake | Why wrong | Fix | |---|---|---| | `role="button"` on a `\` with no keyboard handler | Reachable; not operable | Use `\` primitive | | `aria-label` duplicating visible text | Redundant; sometimes contradictory | Either visible label OR aria-label, not both | | `aria-hidden="true"` on a focusable element | Hidden semantically; reachable by Tab | Use `inert` instead, or remove from tab order | | `role` invalidating native semantics | ` } secondaryAction={ // optional secondary {t("users.empty.cold.learn")} } /> ``` The primitive is part of the shared catalog ([`primitives-pattern.md`](./primitives-pattern.md)). It enshrines layout, typography, spacing — never reinvent per screen. ### Three useful variants 1. **Cold-start empty** — onboarding moment; CTA is "create your first". 2. **Filtered empty** — user has data, filter excludes it; CTA is "clear filter". 3. **Error empty** — request failed; CTA is "retry". The component prop hints which: ``. Variant changes the default icon + tone. ### Honesty about cause Anti-pattern: a single "No results" that hides whether the user has zero data or just filtered it all out. The user clicks "create" thinking they need to create one — and discovers later they already had 50, hidden by a filter. Honest pattern: detect the cause, choose the matching empty state. ```ts function EmptyResolver({ rows, filter, error, hasPermission }) { if (!hasPermission) return ; if (error) return ; if (filter && hasUnderlyingData) return ; return ; } ``` Knowing whether underlying data exists may require a second query (a cheap `count(*)` that ignores the filter). That cost is well spent — the empty state's honesty depends on it. ### Loading vs empty Loading and empty are different states: - **Loading**: show a skeleton (per [`universal.md`](./universal.md) Rule 4). - **Loaded + zero rows**: show empty state. Anti-pattern: empty state flashes during loading (because `rows.length === 0` is briefly true before the fetch resolves). Avoid: distinguish "haven't fetched yet" from "fetched and got nothing". Only show empty when the fetch has resolved with zero rows. ### Empty state is also a teaching moment Cold-start empty is often the first time a user encounters a feature. It is a *zero-cost onboarding surface*: - Explain the feature in one sentence ("Teammates can collaborate on flows in this workspace"). - Tell them the next step ("Invite by email"). - Optional: link to docs / video for deeper context. Filtered-empty and error-empty are not teaching moments — they are recovery moments. Keep them brief. ### Visuals - **Icon / illustration**: optional. Reinforces semantic. Token-driven (no hardcoded color). - **Tone**: matches the cause. Cold = warm invite. Filtered = neutral. Error = serious. - **Size**: takes the same content area as the data would. Avoid pushing the empty state into a corner. ### The gate A lint / completeness check flags patterns like: ```tsx {rows.length === 0 ?
No results
: } {rows.length === 0 &&

No data

} ``` These bypass the empty-state primitive. Failure message points to the `\` import. ### Per-screen empty inventory For each screen with collection surfaces, the completeness contract ([`universal.md`](./universal.md) Rule 9) requires an empty-state pass: - Cold-start cause: ✓ covered with `\`. - Filtered cause (if filters exist): ✓ covered separately. - Error cause: ✓ covered with retry. PR template includes this checklist for any UI-touching PR. ### Common failure modes - **One empty state for all causes.** User cannot tell why empty. → Branch by cause. - **Empty state in a 200×40 px slice.** Looks like a layout bug. → Match the data area. - **Empty state without a CTA.** Dead end. → Always one primary next-step. - **CTA that opens a complex flow.** Friction kills cold-start. → CTA is the simplest valid next action. - **Empty state appears for one frame during loading.** Jittery. → Wait until the fetch resolves with zero. ### See also - [`universal.md`](./universal.md) — Rule 4 (loading), Rule 5 (empty), Rule 9 (completeness). - [`primitives-pattern.md`](./primitives-pattern.md) — `\` is a shared primitive. - [`intl-pattern.md`](./intl-pattern.md) — empty-state copy is intl-keyed. ==== https://playbook.agentskit.io/docs/pillars/ui-ux/i18n-deep-pattern --- title: 'Internationalisation Deep Pattern' description: 'Beyond ''wrap every string in `t()`'' — the substance of locale-correct UI: plural rules, gender, ICU formatting, RTL, dates, numbers, currency, sorting, search.' --- # Internationalisation Deep Pattern Beyond "wrap every string in `t()`" — the substance of locale-correct UI: plural rules, gender, ICU formatting, RTL, dates, numbers, currency, sorting, search. ## TL;DR (human) Real intl is harder than key extraction. Each language has plural rules; some have gender; numbers / dates / currencies format differently by locale; right-to-left languages flip layout. Use ICU MessageFormat for messages; `Intl.*` APIs for formatting; CLDR data for everything locale-specific. Test in pseudo-locales + at least one RTL. ## For agents ### Beyond key extraction The [`intl-pattern.md`](./intl-pattern.md) sibling doc covers the basic discipline: every string keyed, `useT()` everywhere. This doc covers what comes after. ### ICU MessageFormat Plain interpolation is insufficient for plurals and gender: ```ts // ✗ wrong — doesn't pluralise; word order baked in t("results", { count }) // "{count} result(s)" // ✓ ICU MessageFormat t("results", { count }) // "results": "{count, plural, =0 {No results} one {# result} other {# results}}" ``` ICU handles: - **plural**: `=0`, `one`, `two`, `few`, `many`, `other` — depends on locale rules (CLDR). - **select**: branching on a value (gender, status). - **selectordinal**: ordinal numbers (1st, 2nd, 3rd). - **number**, **date**, **time**: format with locale rules. Library: `formatjs/intl-messageformat`, `messageformat`, `i18next` with the icu plugin. Plural rules differ wildly: - English: 2 forms (one / other). - Russian: 4 forms (one / few / many / other). - Arabic: 6 forms (zero / one / two / few / many / other). - Japanese, Chinese: 1 form. Hard-coding "one" / "other" breaks Russian. Use CLDR-derived rules. ### Gender Some languages mark gender: ``` "welcome": "{gender, select, female {Bienvenida} male {Bienvenido} other {Bienvenidos}}, {name}" ``` Gendered translations need: - The user's gender (or "prefer not to say" → use neutral form). - A neutral fallback for languages that don't have gendered forms. Avoid generating sentences from glued fragments — gender + plural agreement requires the whole sentence at once. ### Number formatting ```ts // Locale-aware decimal separator, thousands grouping new Intl.NumberFormat("en-US").format(1234567.89); // "1,234,567.89" new Intl.NumberFormat("de-DE").format(1234567.89); // "1.234.567,89" new Intl.NumberFormat("hi-IN").format(1234567.89); // "12,34,567.89" (Indian numbering) new Intl.NumberFormat("ar-EG").format(1234567.89); // "١٬٢٣٤٬٥٦٧٫٨٩" (Arabic digits) // Percentages new Intl.NumberFormat("en-US", { style: "percent" }).format(0.42); // "42%" // Compact new Intl.NumberFormat("en-US", { notation: "compact" }).format(12345); // "12K" ``` ### Currency ```ts new Intl.NumberFormat("en-US", { style: "currency", currency: "USD" }).format(99.95); // "$99.95" new Intl.NumberFormat("ja-JP", { style: "currency", currency: "JPY" }).format(99.95); // "¥100" (rounded; no decimals) new Intl.NumberFormat("de-DE", { style: "currency", currency: "EUR" }).format(99.95); // "99,95 €" ``` The currency code (USD / JPY / EUR) is part of the data, not derived from locale. A user in Germany might view US dollars. ### Date and time ```ts new Intl.DateTimeFormat("en-US").format(new Date()); // "10/14/2024" new Intl.DateTimeFormat("en-GB").format(new Date()); // "14/10/2024" new Intl.DateTimeFormat("ja-JP").format(new Date()); // "2024/10/14" new Intl.DateTimeFormat("ar-EG").format(new Date()); // arabic-indic digits // Relative time new Intl.RelativeTimeFormat("en-US").format(-1, "day"); // "1 day ago" new Intl.RelativeTimeFormat("es-ES").format(-1, "day"); // "hace 1 día" ``` Timezone discipline: - Server stores UTC (ISO-8601 with offset). - Client renders in user's locale + timezone. - For "5 days from now" calculations: use the user's timezone (a date in Tokyo is not the same date in LA). Libraries: native `Intl.*` is usually enough; `date-fns` + `date-fns-tz` or `Luxon` for richer manipulation. ### Right-to-left (RTL) Arabic, Hebrew, Persian, Urdu read right-to-left. CSS: - `dir="rtl"` on `\` or per-region. - Logical properties: `margin-inline-start` (not `margin-left`), `padding-inline-end` (not `padding-right`). - Icons that imply direction (arrows, chevrons) mirror. - Text alignment: `text-align: start` (not `text-align: left`). Mixed-direction content (English text in Arabic UI): use `\` and `dir="auto"`. Layouts that look fine in LTR can be broken in RTL: - Asymmetric padding. - Custom dropdowns with hardcoded positioning. - Carousels with directional swipe. Test in at least one RTL locale before shipping. ### Locale identifiers (BCP 47) | Format | Meaning | |---|---| | `en` | English (any region) | | `en-US` | English, United States | | `en-GB` | English, United Kingdom | | `pt-BR` | Portuguese, Brazil | | `pt-PT` | Portuguese, Portugal | | `zh-Hant` | Traditional Chinese | | `zh-Hans` | Simplified Chinese | | `ar-EG` | Arabic, Egypt | User locale → fallback chain: `pt-BR` → `pt` → default (`en`). Implement: locale = user setting + browser hint + URL param, with explicit precedence. ### Sort + search Locale-aware string comparison: ```ts "ä".localeCompare("z", "de"); // -1 (ä before z in German) "ä".localeCompare("z", "sv"); // 1 (ä after z in Swedish) ``` `Intl.Collator` for batch sorting. Locale-aware sort matters for: - User-facing lists (sort by name). - Search match scoring. - Autocomplete ranking. ### Pluralisation of intl keys themselves Avoid: ```ts t("invite-button") // "Invite" t("invite-buttons") // "Invites" ``` Two keys, two translations, two slots to drift. Instead: ```ts t("invite", { count }) // ICU plural handles it ``` One key, one translation, plurals correct in every locale. ### Translation workflow Three actors: - **Developer**: adds keys to source locale (typically `en`). - **Translator**: receives keys; produces target locales. - **Translation management** (TMS): platform (Phrase, Crowdin, Lokalise) that syncs keys, manages translator work, returns completed translations. CI checks: - Every source-locale key exists in every shipped locale (or has documented fallback). - No orphan keys (in target but not source). - No untranslated keys (in source but not target, beyond fallback policy). ### Pseudo-locale for testing A `qa` / `pseudo` locale transforms strings: ``` Save → [!! Šåvé !!] Loading… → [!! Łõåðîñğ… (~30% longer) !!] Welcome to Acme → [!! Wélçömé tö Áçmé !!] ``` Run the app in pseudo-locale: - Hardcoded strings stand out (not transformed). - Length-sensitive layouts show their breakage. - Missing keys obvious (no `[!! ... !!]` wrap). CI screenshots in pseudo-locale catches drift before release. ### Currency + region pairing A pricing page shows different prices per region. Two concerns: - **Display currency**: format per user locale, regardless of price source. - **Tax / VAT**: per region; show inclusive vs exclusive per regulatory norm. Avoid mixing the user's locale with the *product's* currency (a Japanese user viewing USD pricing — keep USD; don't auto-convert unless you mean to). ### Domain-specific localisation Things that are NOT translated: - Brand product name (per [`whitelabel-pattern.md`](./whitelabel-pattern.md) brand-token allowlist). - Code identifiers, file paths, URLs. - Author / contributor names. - Third-party brand names (Slack, GitHub). Things that ARE translated: - Generic terms ("workspace", "user", "settings"). - Status labels ("Running", "Failed"). - Error messages. - Help text. ### Common failure modes - **Plain interpolation for plurals**. "1 result(s)" — broken in any non-English locale. → ICU MessageFormat. - **Date / number raw**. `formatDate(d)` returns ISO. Users see machine format. → `Intl.*`. - **Locale derived from currency**. User in Brazil viewing USD; UI assumes pt-BR formatting for `$`. → Locale and currency independent. - **Hardcoded `margin-left`**. RTL breaks. → Logical properties. - **String concat for sentences**. `t("hello") + " " + name + "!"` → word order assumption baked in. → ICU. - **No RTL test**. Bidi bugs ship. → At least one RTL in CI snapshots. - **Mixed-language fragments**. `Welcome to {productName}, ${userName}!` — direction ambiguity. → `\` / `dir="auto"`. - **CLDR not bundled**. Locale features missing at runtime. → Include CLDR data for shipped locales (bundle size cost; trade-off). ### Tooling stack (typical) | Concern | Tool | |---|---| | Runtime formatting | Native `Intl.*` (broad browser support) | | Message formatting | formatjs, i18next, lingui, react-intl | | Plural / gender data | CLDR (bundled by libraries above) | | TMS platform | Phrase, Crowdin, Lokalise, Tolgee | | Static extraction | i18next-parser, formatjs CLI, lingui extract | | Coverage / parity | i18next-locize, in-house gate | | Date manipulation | `date-fns` + `date-fns-tz`, Luxon, native `Intl.DateTimeFormat` | | Pseudo-locale | pseudo-loc, in-house | ### Adoption path 1. **Day 0**: `useT()` for all strings; `en` only; parity gate disabled. 2. **Month 1**: add ICU MessageFormat for plurals. 3. **Month 2**: add `Intl.*` for date / number / currency. 4. **Quarter 1**: first non-`en` locale; parity gate; pseudo-locale in CI. 5. **Quarter 2**: TMS workflow with external translators. 6. **Quarter 3**: RTL locale; bidi audit on changed screens. 7. **Mature**: localised search, sort, region-aware features. ### See also - [`intl-pattern.md`](./intl-pattern.md) — the basic discipline. - [`whitelabel-pattern.md`](./whitelabel-pattern.md) — product name as a brand token. - [`accessibility-deep-pattern.md`](./accessibility-deep-pattern.md) — `lang` attribute; reading order; bidi. - [`universal.md`](./universal.md) — Rule 3 (intl every string), Rule 8 (human verbs). ==== https://playbook.agentskit.io/docs/pillars/ui-ux/intl-pattern --- title: 'Intl Pattern' description: 'How to never ship hardcoded user-visible strings, so the next locale is configuration, not a sweep.' --- # Intl Pattern How to never ship hardcoded user-visible strings, so the next locale is configuration, not a sweep. ## TL;DR (human) Every visible string is a key in a locale file. Components resolve via `useT()` (or your hook). Keys are namespaced by screen / feature. Interpolation is structured (named placeholders, not positional). Aria, title, placeholder, alt attributes are also intl-resolved. Brand tokens (productName) bypass intl via an allowlist. ## For agents ### Key structure Namespaced, dot-separated: ``` flows.editor.save.label flows.editor.save.aria flows.editor.save.success flows.editor.save.error users.empty.title users.empty.description users.empty.invite.cta ``` Conventions: - First segment: feature / screen (`flows`, `users`, `dashboard`). - Subsequent segments: nested context (`editor.save`, `empty`). - Leaf: purpose (`label`, `aria`, `description`, `cta`, `success`, `error`). This produces stable, greppable keys. Agents searching for "where is this string defined" find one place. ### Locale files One file per locale, by convention: ``` locales/ ├── en.json ├── es.json ├── pt-BR.json └── ... ``` Contents: ```json { "flows.editor.save.label": "Save", "flows.editor.save.aria": "Save the current flow", "flows.editor.save.success": "Flow saved.", "flows.editor.save.error": "Could not save: {reason}" } ``` Flat key namespace; nested objects optional but get verbose for deep keys. A small build step ensures every key exists in every shipped locale (or has a documented fallback to `en`). ### The hook ```ts const t = useT(); // simple: t("flows.editor.save.label") // → "Save" // with interpolation: t("flows.editor.save.error", { reason: err.message }) // → "Could not save: Network timeout" ``` Behavior: - Missing key in current locale → falls back to `en`. - Missing key in `en` → returns the key itself (visible bug; not silent). - Interpolation values are escaped per the framework (HTML-escape in JSX context). ### What gets intl'd User-visible: - JSX text content. - `aria-label`, `aria-description`. - `title`, `placeholder`, `alt`. - Error messages displayed to users (server returns `code` → client resolves to localized message). - Toast / notification strings. - Status labels (per [`universal.md`](./universal.md) Rule 8). Not intl'd: - Brand tokens (product name, company name) — via `whitelabel` runtime, allowlisted. - Code / technical identifiers (URLs, capability names, error codes). - Author / contributor names. - Untranslatable proper nouns (third-party brand names). The allowlist of brand tokens lives in a small file (`i18n/exempt-tokens.json` or equivalent). Lint allows tokens in the allowlist; everything else hits the rule. ### The gate Lint AST rules: 1. **JSX text literal** in `*.tsx` files where the content is non-empty and contains a letter. Fail. 2. **Hardcoded `aria-label` / `aria-description` / `title` / `placeholder` / `alt`** as string literals. Fail. Exemptions: - Empty strings. - Strings matching the brand-token allowlist exactly. - Strings inside `\`, `\`, `\` JSX elements. - Comments. Lint script ships at [`../../scripts/check-intl.example.mjs`](../../scripts/check-intl.example.mjs). ### Interpolation discipline ```ts // ✓ named placeholders t("flows.run.banner", { count: 3, name: "deploy" }) // → "3 flows running: deploy" // ✗ positional t("flows.run.banner", 3, "deploy") ``` Why named: - Order can change per-locale. - Reviewer sees the variable names; can verify they match the key. - Adding a placeholder later does not break old call sites. For pluralization, use the framework's plural rules: ```json "flows.run.banner": "{count, plural, one {# flow} other {# flows}} running" ``` ### Locale parity A CI gate verifies: - Every key in `en.json` exists in every other locale (or is documented as inheriting from `en`). - No locale has keys that do not exist in `en` (orphans). - Placeholder names match across locales for the same key. This prevents one locale silently lagging. ### Pseudo-locale for testing A `qa` or `pseudo` locale that transforms strings (e.g. `"Save"` → `"[!! Šåvé !!]"`) helps catch: - Hardcoded strings (they don't transform; they stand out). - Length-sensitive layouts (pseudo strings are ~30% longer). - Encoding issues. CI screenshots the app in pseudo locale; reviewer scans for hardcoded English. ### Migration path 1. Stand up the locale file structure + the hook. 2. Define keys for new code. 3. Generate baseline of existing hardcoded strings. 4. Gate to shrink-only. 5. Run a codemod / mass-extract for the easy cases (literal JSX text). 6. Manual sweep for complex cases (concatenations, conditionals). ### Common failure modes - **String concatenation in JSX**: `\Hello {name}!\`. The literal portions are not intl'd. → Use interpolation: `t("greeting", { name })`. - **Conditional fragments**: `{isLoading ? "Loading..." : "Ready"}`. Both literals leaked. → Two keys. - **Concatenating intl results**: `t("a") + " " + t("b")`. Word order assumption baked in. → Single key with interpolation. - **Localizing error codes** in server responses. Client cannot pattern-match. → Server returns stable codes; client maps to localized message. - **Missing brand-token allowlist**. Every product-name use violates intl gate. → Add allowlist; whitelabel runtime resolves `productName`. ### See also - [`universal.md`](./universal.md) — Rule 3. - [`whitelabel-pattern.md`](./whitelabel-pattern.md) — brand-token resolution. - [`a11y-checklist.md`](./a11y-checklist.md) — aria labels are intl'd too. ==== https://playbook.agentskit.io/docs/pillars/ui-ux/primitives-pattern --- title: 'Primitives Pattern' description: 'How to ship a one-package primitives catalog so every screen looks like the same product.' --- # Primitives Pattern How to ship a one-package primitives catalog so every screen looks like the same product. ## TL;DR (human) One UI package owns every interactive primitive. Components in screens import from that package. Native HTML elements (`\`, `\`, `\`, etc.) are lint-banned in shipped surfaces. Primitives are styled with design tokens; brand swap reaches them automatically. ## For agents ### The catalog Minimum viable primitives catalog: | Primitive | Replaces native | Variants | |---|---|---| | `Button` | `\`, `\` (action) | primary / secondary / ghost / danger; sm / md / lg | | `IconButton` | `\` with icon-only content | size + variant | | `Link` | `\` (navigation) | primary / muted | | `Input` | ``| with-label / inline / search / password | | `Textarea` | `\` | auto-resize / fixed | | `Select` | `\` | single / multi (using Radix or equivalent) | | `Checkbox` | `` | with-label / indeterminate | | `Radio`, `RadioGroup` | `` | with-label | | `Switch` | `` (toggle role) | | | `Dialog` | `\` | modal / drawer | | `Tooltip` | `title` attr | | | `Tabs` | `role="tablist"` boilerplate | | | `Table` | `\` | sortable / paginated | | `Badge` | `\` with class | status colors | | `Avatar` | `\` | with-initials / with-presence | | `EmptyState` | (none — new primitive) | with-icon / with-illustration | | `Skeleton` | (loading shimmer) | text / block / row | | `Toast` | (system notification) | success / error / info | Add per project: `KPI`, `Card`, `Stat`, `Stepper`, `BreadcrumbBar`, etc. ### Why a primitive replaces a native element | Concern | Native | Primitive | |---|---|---| | Styling | Browser default; varies | Token-driven; consistent | | A11y attributes | Manually applied per use | Built-in; consistent | | Keyboard handling | Browser default; subtle bugs | Tested + consistent | | Focus ring | Browser default (sometimes invisible) | Token-driven; always visible | | Disabled state | `disabled` only | `disabled` + visual + aria-disabled | | Loading state | Manual hand-rolling | Built-in `loading` prop on `Button` | | Form integration | Native | Compatible with form library | A primitive enshrines the right pattern once; every consumer benefits. ### Built on something Build on top of a headless library (Radix Primitives, React Aria, Headless UI) for a11y semantics. Wrap with your tokens + variants. Do not roll keyboard handling and focus management yourself; the headless libraries have spent thousands of hours on edge cases. Your job: - Provide the visual layer (tokens, variants). - Provide consistent API (prop names, callback signatures) across primitives. - Provide your project's idioms (loading prop, intl prop). ### API consistency All primitives share API conventions: ```ts interface BasePrimitiveProps { // identification id?: string; className?: string; // composable; never wholesale-replaces internal styles // a11y "aria-label"?: string; "aria-labelledby"?: string; "aria-describedby"?: string; // state disabled?: boolean; loading?: boolean; // where applicable } ``` Variants are declared with a small variant utility (e.g. `cva` from `class-variance-authority`) so the same `variant` / `size` prop semantics apply everywhere. ### File / package shape ``` packages/ui/ ├── src/ │ ├── button/ │ │ ├── button.tsx // ≤ 200 lines │ │ ├── button.stories.tsx // visual catalog │ │ ├── button.test.tsx │ │ └── index.ts │ ├── input/ │ ├── select/ │ ├── dialog/ │ ├── tokens.css // token definitions (or imported from a sibling pkg) │ └── index.ts // re-export all primitives └── package.json ``` Per-primitive subdir lets you split the implementation as it grows; the index file is one barrel for consumers. ### Stories / visual catalog Every primitive has a stories file. Two purposes: 1. **Visual regression**: snapshots run in CI; token drift surfaces. 2. **Documentation**: agents reading the primitive's API see all variants in one place. Stories use Ladle / Storybook / your project's tool. Same toolchain as the rest of the repo. ### The gate Lint rule: ```js // no native html in shipped surfaces "no-restricted-syntax": [ "error", { selector: "JSXOpeningElement[name.name=/^(button|input|select|textarea|dialog|form|table|a)$/]", message: "Use the shared primitive from @your/ui instead of native HTML.", }, ], ``` Per file overrides for framework-mandated places (Next.js layouts, MDX content, raw HTML editors). Escape hatch: `// allow-native: \`. Counted by a gate that fails on growth. ### Migration path Brownfield: large existing app, lots of native elements. 1. Ship primitives. 2. Generate baseline of native-html offenders. 3. Gate to shrink-only. 4. Codemod where possible (` // ✓ right ``` Brand tokens (product name, company name) are exempt — they live in a small allowlist and resolve from whitelabel runtime. Gate: lint AST scan for JSX text nodes with non-empty string literals + hardcoded `aria-*` / `title` / `placeholder` / `alt` attributes. Exempt: comments, `\` / `\` content, allowlisted brand tokens. **Failure mode prevented:** half-translated UI; aria attributes only in English; brand rename requires touching every screen. ### Rule 4 — Skeletons for content loading; spinners only for inline actions When a content surface is loading data: - Show a **skeleton** that matches the final layout shape. - Do not show a spinner that replaces content. Spinners are reserved for inline actions (a button while submitting; a row while saving). Why: skeletons preserve layout (no jank when content arrives); communicate roughly what is coming (sets expectation); avoid the "what is this loading forever?" panic state. Gate: lint regex for `\` (or equivalent) inside content-bearing layout containers. Combined with manual review for "this surface shows a spinner instead of a skeleton". **Failure mode prevented:** layout jank on every page load; users panic at indeterminate spinners. ### Rule 5 — Empty states always-on, always tell next step Every list / collection surface has an empty state component. The empty state: - States what would be here (one sentence). - States the next step (a CTA, a link, an instruction). - Is **honest** about why empty (no data yet vs filtered out vs permission-denied — each is a different empty state). ```tsx // ✗ wrong {rows.length === 0 ?
No results
:
} // ✓ right {rows.length === 0 ? {t("users.empty.invite")}} /> :
} ``` Different empty causes → different empty states. "No results match filter" is different from "No users in this workspace yet". Gate: a completeness check that flags `length === 0 ? \` patterns; the `\` primitive must be used. **Failure mode prevented:** users land on an empty page with no idea what to do; honest empty cause is hidden behind generic "no results". ### Rule 6 — Motion respects `prefers-reduced-motion`; durations from tokens Every animation / transition: - Reads `prefers-reduced-motion` from the user OS; honors it. - Uses duration tokens (`--duration-fast`, `--duration-normal`, `--duration-slow`). - Does not exceed ~300ms for UX feedback; longer durations are deliberate and rare. Two principles: - **Motion is communication, not decoration.** Movement clarifies state change. Decorative motion fatigues. - **Reduced motion is not "no motion".** Opacity changes, color changes, and instant transitions remain. Translates and rotations stop. Gate: lint scan for `transition-duration: \d` (hardcoded ms), `transform: translate` in CSS without a `@media (prefers-reduced-motion: reduce)` partner rule (heuristic; opt-in). **Failure mode prevented:** vestibular-disorder users experience nausea; brand motion drift across screens. ### Rule 7 — Keyboard-first; screen-reader pass per changed screen Every interactive element: - Is reachable by Tab (logical order). - Has a visible focus indicator (not the default browser outline, but a token-based replacement). - Triggers its primary action on Enter / Space (per HTML semantics — Enter for buttons / links; Enter+Space for buttons; Enter only for links). - Has an accessible name (label, aria-label, aria-labelledby). For changed screens, run a screen-reader pass before merging: - VoiceOver (macOS) / NVDA (Windows) / TalkBack (Android). - Read top-to-bottom, then by region. - Confirm: page title makes sense; landmarks navigate cleanly; live regions announce. Gate: axe / `@axe-core` automated scan on every changed screen in CI. Screen-reader pass is a manual checklist linked from the PR template for ui-touching PRs. **Failure mode prevented:** screens that are unusable without a mouse; screen readers reading raw class names or "button button button" with no context. ### Rule 8 — Status language in human verbs, never enum codes User-visible status: - ✓ "Saved", "Running", "Awaiting approval", "Failed to send", "Reconnecting…" - ✗ "succeeded", "PENDING", "AWAITING_HUMAN_INPUT", "error_state_2" Internal status enums map to user-visible labels via an i18n table. Multiple internal states can collapse to one user-visible label when the user does not need the distinction. Gate: lint scan for known internal enum values in JSX text nodes ("succeeded", "pending", "upserted", etc.). Project maintains the project-specific banned list. **Failure mode prevented:** users see "AWAITING_HUMAN_INPUT" and ask support what it means; brand voice broken by leaked enum strings. ### Rule 9 — Per-screen completeness contract — no disabled tabs, no `not implemented` Every screen that ships passes the completeness contract: - No `TODO` / `FIXME` markers in the shipped tree. - No `disabled: true` tabs in nav. - No `throw new Error('not implemented')` in shipped surfaces. - No empty exported component bodies. - Every tab in the screen renders meaningful content. If a feature is not built, it does not ship in nav. Behind a feature flag, fine. Built but disabled, never. Gate: a `check-completeness` script scans the shipped tree. See [`../../scripts/README.md`](../../scripts/README.md). Per-screen completeness contracts live in `docs/completion/\.md` (or equivalent): what "done" means for that screen, what tabs / sub-views are in scope. **Failure mode prevented:** screens that look done but break on click; demo gets to a tab that crashes; investor sees `[NOT IMPL]` in production. ### Rule 10 — Whitelabel-ready — product name + logo + palette swappable Even if you do not currently sell whitelabel: build as if you might. Costs little; saves enormously when the request comes. What that means: - **Product name** comes from a `productName` token, not hardcoded in JSX. Default to a fallback (e.g. `"App"`) so omitting the token does not produce empty strings. - **Logo / favicon** swappable via a brand-kit JSON. - **Palette / typography** swappable via token overrides. - **Plan presets** (which features are enabled for which tier) configurable. - **Build / runtime distinction**: brand assets baked at build for performance; runtime overrides for preview / OEM admin. Gate: a small "whitelabel readiness" check ensures `productName` token usage in user-visible strings; no hardcoded brand strings in shipped code outside the allowlist. **Failure mode prevented:** "we got an OEM customer; can you reskin?" answered with a six-month project; brand strings leak via PRs that bypass the intl gate. ## See also - [`design-tokens-pattern.md`](./design-tokens-pattern.md), [`primitives-pattern.md`](./primitives-pattern.md), [`intl-pattern.md`](./intl-pattern.md), [`empty-states-pattern.md`](./empty-states-pattern.md), [`a11y-checklist.md`](./a11y-checklist.md), [`whitelabel-pattern.md`](./whitelabel-pattern.md) - [`../architecture/file-size-budget.md`](../architecture/file-size-budget.md) — `.tsx` budget forces sub-component extraction. - [`../quality/quality-gates-pattern.md`](../quality/quality-gates-pattern.md) — token / native-html / intl / completeness gates. ==== https://playbook.agentskit.io/docs/pillars/ui-ux/whitelabel-pattern --- title: 'Whitelabel Pattern' description: 'How to make every product surface reskinnable per tenant — even if you do not sell whitelabel today.' --- # Whitelabel Pattern How to make every product surface reskinnable per tenant — even if you do not sell whitelabel today. ## TL;DR (human) A whitelabel runtime resolves product name, logos, palette, typography, motion, and plan presets per tenant. Components reference token names and `productName`; values come from the runtime. Build-time bake for performance; runtime overrides for preview / OEM admin. ## For agents ### What "whitelabel" includes | Surface | Whitelabel'd via | |---|---| | Product name | `productName` token (with safe fallback) | | Logo + favicon | Brand kit asset paths | | Palette | Token overrides (primitive + semantic layers) | | Typography | Font family + weight overrides | | Motion | Duration + easing overrides (rare) | | Plan presets | Feature flags resolved per tenant tier | | Legal links | ToS, privacy, contact resolved per tenant | | Email templates | Sender name + footer per tenant | What is **not** whitelabel'd (always the same): - Behavior. The product does the same things regardless of brand. - Names of features in UI (those are intl, brand-token-free). - Stable API contracts (consumers depend on these). ### Brand kit shape A brand kit is a JSON document. Schema: ```ts type BrandKit = { productName: string; // "AppName" productNameFallback: string; // when productName missing at render time legalEntity?: string; // company name in footers palette: { accent: string; // oklch / hex surface1: string; surface2: string; textPrimary: string; textOnAccent: string; danger: string; success: string; // ... per project }; typography: { fontSans: string; // "Inter, system-ui" fontMono: string; fontDisplay?: string; }; logos: { primary: string; // path / data url favicon: string; emailHeader?: string; }; motion?: { durationFast: string; durationNormal: string; durationSlow: string; }; legalLinks?: { terms?: string; privacy?: string; contact?: string; }; planPresets?: Record; }; ``` A default brand kit ships with the repo. Per-tenant overrides apply via the runtime. ### Runtime ```ts // at app boot const kit = await loadBrandKit({ tenantId, env }); applyBrandKit(kit); // applyBrandKit: // 1. Writes CSS variables to :root. // 2. Updates the React context with productName + logos + legalLinks. // 3. Sets the favicon link tag. // 4. Updates document.title prefix if configured. ``` The context provides hooks: ```ts const { productName, logos, legalLinks } = useBrand(); const t = useT(); return (

{t("dashboard.title", { product: productName })}

); ``` ### Build-time vs runtime Two modes, often combined: | Mode | Behavior | Use when | |---|---|---| | Build-time | Brand kit baked into the bundle at build | Single-brand deploy; max performance | | Runtime | Brand kit fetched per session at boot | Multi-tenant cloud; per-tenant overrides | Implementation pattern: - Build-time defaults bake in. - Runtime override applies on top after boot. - Hot-swap (preview "what would my new brand look like") via re-running `applyBrandKit`. ### Plan presets Per-tenant feature gating goes through the brand kit's `planPresets`: ```ts planPresets: { free: { maxWorkspaces: 1, customDomain: false, ssoEnabled: false }, pro: { maxWorkspaces: 5, customDomain: false, ssoEnabled: false }, team: { maxWorkspaces: 25, customDomain: true, ssoEnabled: true }, } ``` The runtime resolves the tenant's plan, exposes feature flags via a hook: ```ts const { canUseCustomDomain, ssoEnabled } = usePlan(); ``` This keeps plan logic out of business code. ### Product name in strings The single trickiest case: a user-visible string mentions the product name. ```ts // ✗ wrong — hardcoded t("welcome.banner") // → "Welcome to AppName" // ✓ right — interpolated t("welcome.banner", { product: productName }) // → "Welcome to {product}" ``` Locale files contain `"welcome.banner": "Welcome to {product}"`. Intl + whitelabel compose; product name swaps without re-translating. Fallback discipline: if `productName` is missing (e.g. brand kit failed to load), fall back to a safe short string like `"App"`. Empty strings produce broken-looking copy ("Welcome to "). ### The gate A "whitelabel readiness" check ensures: 1. **No hardcoded product name** outside the allowlist. Grep for the production product name in source; fail if found. 2. **No hardcoded brand colors** outside the design-token system. Covered by [`design-tokens-pattern.md`](./design-tokens-pattern.md). 3. **Default brand kit loads** in CI; a test brand kit also loads, both render the app, no errors. ### Default brand kit + test brand kit Two brand kits shipped with the repo: - **`default.json`**: the product's primary brand. - **`test.json`**: a wildly different brand (orange instead of blue, different typography, different name). CI renders against this and snapshots key surfaces. Drift between the two indicates a hardcode. ### Common failure modes - **`productName` hardcoded in JSX literal.** Intl gate catches; but easy to miss in attributes (e.g. ``). → Whitelabel readiness gate scans for the brand name in source files. - **Image / logo files referenced by hardcoded path.** Cannot swap. → Logos go through the brand kit's `logos.primary`. - **Plan logic inline in business code** (`if (workspace.tier === "free") ...`). Hard to whitelabel pricing tiers. → Plan logic in `planPresets`; consumer asks the runtime. - **Brand kit applied client-side only.** Server-rendered HTML has the wrong brand for a flash. → Server resolves the kit before render. - **No fallback for missing brand kit.** Page crashes if fetch fails. → Default brand kit always available as fallback. - **OEM admin can edit anything.** Including critical strings that should not move. → OEM admin edits brand kit fields only; never raw locale files or business code. ### Adoption path even if you do not sell whitelabel Cost of building whitelabel-ready from day one: small. Cost of retrofitting later: large. Even single-brand projects benefit: - Cleaner separation of brand assets and code. - Easier dark-mode / theme variants (themes are mini brand kits). - Easier reskin if the company rebrands. - Easier acquisition by a buyer (whitelabel = customer-ready). ### See also - [`universal.md`](./universal.md) — Rule 10. - [`design-tokens-pattern.md`](./design-tokens-pattern.md) — palette flows through tokens. - [`intl-pattern.md`](./intl-pattern.md) — product name composes with intl. ==== https://playbook.agentskit.io/docs/prompts --- title: 'Reusable prompts' description: 'System prompts, sub-agent recipes, and slash-command bodies that consistently produce gold-standard output.' --- # Reusable prompts System prompts, sub-agent recipes, and slash-command bodies that consistently produce gold-standard output. ## Status ✓ v1 — 12 prompt bodies shipped. Adapt the bodies to your toolchain (Claude Code, Cursor, Aider, your CLI). ## Index | Prompt | Type | Use when | |---|---|---| | [`system-architect.md`](./system-architect.md) | system | Designing a new package boundary, ADR, or contract | | [`system-implementer.md`](./system-implementer.md) | system | Building a sub-unit against an existing design | | [`system-reviewer.md`](./system-reviewer.md) | system | Code review pass with confidence-scored output | | [`system-security.md`](./system-security.md) | system | Security review of pending changes | | [`subagent-explore.md`](./subagent-explore.md) | sub-agent recipe | Read-only fan-out search across files | | [`subagent-plan.md`](./subagent-plan.md) | sub-agent recipe | Step-by-step implementation plan | | [`subagent-code-explorer.md`](./subagent-code-explorer.md) | sub-agent recipe | Trace execution paths, map dependencies | | [`subagent-code-reviewer.md`](./subagent-code-reviewer.md) | sub-agent recipe | Confidence-filtered review | | [`slash-goal.md`](./slash-goal.md) | slash command | Set a session goal + stop hook | | [`slash-loop.md`](./slash-loop.md) | slash command | Schedule recurring or self-paced runs | | [`slash-review.md`](./slash-review.md) | slash command | Multi-agent PR review | | [`slash-clear.md`](./slash-clear.md) | slash command | Reset session context cleanly | | [`slash-sanity.md`](./slash-sanity.md) | slash command | Run cross-cutting sanity audit | | [`slash-ship.md`](./slash-ship.md) | slash command | Run release-gate checklist | ## Sub-agent strategy When orchestrating long fan-outs, delegate to scoped specialists: | Task | Sub-agent type | Model tier | |---|---|---| | File / symbol lookup | `explore` | haiku | | Documentation, unit tests, code review | `plan` / `code-reviewer` | sonnet | | Complex implementation needing deep reasoning | `implementer` | opus | Tier by task complexity. Reserve opus for what truly needs it; haiku for trivial fan-outs. ## Slash-command discipline - One command = one well-scoped workflow. - The command body is a prompt template, not a script. - Commands that side-effect (open PR, push, merge) require explicit user confirmation in their body. ## See also - [`../pillars/ai-collaboration/README.md`](../pillars/ai-collaboration/README.md) - [`../templates/CLAUDE.md.template.md`](../templates/CLAUDE.md.template.md) — bootstrap doc references these. ==== https://playbook.agentskit.io/docs/prompts/slash-clear --- title: 'Slash Command — /clear' description: 'Reset session context cleanly without losing persistent memory.' --- # Slash Command — /clear Reset session context cleanly without losing persistent memory. ## Trigger ``` /clear ``` ## Body ``` Reset session context. Persistent memory is preserved; ephemeral chat history is dropped. What happens: 1. The agent's working context (current task, recent file reads, conversation history) is cleared. 2. The next message starts fresh. 3. Bootstrap docs (CLAUDE.md, AGENTS.md) reload on next read. 4. MEMORY.md index reloads on next session start. What does NOT happen: - Persistent memory files (`.agent-memory/*.md`) are NOT deleted. - Repo state is NOT touched (no `git reset`, no `git checkout`). - Open files in the editor are NOT closed. When to use: - The agent's context drifted (started a different task; reasoning about stale state). - Context window is full and the current task is unrelated to past chat. - Starting a new session after a long break. When NOT to use: - Mid-implementation of a sub-unit — losing the implementation context is wasteful. - To "fix" a problem by clearing context — the problem will recur. Fix the root cause. Compared to other resets: - `/clear` — chat only. - `git reset --hard origin/main` — discards uncommitted work; explicit destructive action. - New session entirely (close + reopen) — equivalent to /clear in most toolchains. Rules: - /clear is reversible only if the toolchain preserves transcripts. Treat as one-way. - After /clear, re-read CLAUDE.md and the issue/task description before resuming. ``` ## See also - [`../pillars/ai-collaboration/memory-pattern.md`](../pillars/ai-collaboration/memory-pattern.md) — memory survives. - [`../pillars/ai-collaboration/bootstrap-doc-pattern.md`](../pillars/ai-collaboration/bootstrap-doc-pattern.md) — bootstrap reload on next session. ==== https://playbook.agentskit.io/docs/prompts/slash-goal --- title: 'Slash Command — /goal' description: 'Set a session goal with an explicit exit condition. Stops only when condition holds.' --- # Slash Command — /goal Set a session goal with an explicit exit condition. Stops only when condition holds. ## Trigger ``` /goal ``` ## Args - `\` — the success state. Imperative phrasing. Examples: - `tests green for package X on a fresh clone` - `PR open with intent manifest and gates passing` - `ADR-NNNN drafted, reviewed, and accepted` ## Body ``` A session goal has been set. The session does not end until the condition holds: Rules: 1. WORK TOWARD THE CONDITION - Every action contributes to the goal or is justified as a prerequisite. - Do not stop because the turn "feels complete" — only when the condition is verifiably met. 2. ASSESS PROGRESS HONESTLY - At the end of each work block, state where you are vs the condition. - "Halfway" or "blocked on X" is acceptable; "done" without verification is not. 3. VERIFY THE EXIT - The condition is met when: - Gates / tests / `gh issue view` (whatever applies) confirms it. - The verification is reproducible — you describe how to re-check it. - State the verification in the final message. 4. UNBLOCK - If you cannot make progress: state why explicitly. - Pick the smallest unblock action (file an issue, ask a question, run a diagnostic). - Do not spin. 5. SCOPE GUARD - Do not expand the goal. New work discovered along the way → file an issue, do not pursue. - If the original goal is wrong (impossible / poorly defined), STOP and ask for an updated goal. 6. ON EXIT - State the goal again. - State the verification (how you know it holds). - Summarize the work that got there (bullet list). Hard rules: - No optimistic reporting. If gates failed, gates failed. - No "should be done" — only "verified done" or "blocked on X". ``` ## Combining with other commands `/goal` typically sits at the session opening. Inside it, the agent may delegate to sub-agents (`subagent-explore`, `subagent-plan`, `subagent-code-reviewer`) or invoke other slash commands (`/sanity`, `/review`). ## Common failure modes - **Ambiguous condition.** "Make it better." — not verifiable. → Re-state until verifiable. - **Agent declares done without verification.** → Body's Rule 3 mandates a verification step. - **Goal scope creep.** → Body's Rule 5 mandates new work goes to an issue. - **Infinite loop on a blocked task.** → Body's Rule 4 mandates explicit unblock or stop. ## See also - [`../pillars/ai-collaboration/universal.md`](../pillars/ai-collaboration/universal.md) — Rule 7 (explicit goal + exit). - [`slash-loop.md`](./slash-loop.md) — for self-paced or recurring runs. ==== https://playbook.agentskit.io/docs/prompts/slash-loop --- title: 'Slash Command — /loop' description: 'Run a task repeatedly. Either on a fixed interval or self-paced based on an exit condition.' --- # Slash Command — /loop Run a task repeatedly. Either on a fixed interval or self-paced based on an exit condition. ## Trigger ``` /loop [] ``` - `/loop 5m /sanity` — every 5 minutes, run `/sanity`. - `/loop /watch-pr 1234` — self-paced; wake when the PR's state changes meaningfully. - `/loop` — autonomous; agent picks the next task each tick. ## Args - `\` (optional) — fixed interval (`30s`, `5m`, `1h`). Omit for self-paced. - `\` — the prompt body or slash command to execute each tick. ## Body ``` You are in a loop. Each tick: execute the task, then schedule the next. Rules: 1. EXECUTE - Run the task body verbatim. - Report status at the end of each tick (succeeded / blocked / found nothing). 2. SCHEDULE NEXT TICK - Fixed interval: at after this tick ends. - Self-paced: pick the delay based on what you are waiting for. - Cache-friendly windows: ≤ 270s (cache stays warm) or ≥ 1200s (one cache miss buys long wait). - Avoid 300s exactly — busts cache for no extra wait. - Default idle delay: 1200–1800s (20–30 min). - Polling external state (CI, deploy): match the state's change cadence. 3. EXIT CONDITIONS - Task succeeds in a way that makes further runs pointless → STOP. - User explicitly stops the loop → STOP. - Repeated failures with no progress (3+) → STOP, file issue, ask user. 4. HONEST REPORTING PER TICK - "succeeded" only when verified. - "blocked on X" when you cannot make progress. - Quote failures verbatim. 5. DO NOT BURN CACHE - If you find yourself sleeping 300s repeatedly, switch to ≤ 270s (stay warm) or ≥ 1200s (commit to long wait). - Polling something the harness can notify you about is wasted — sleep long; the harness will wake you. 6. SCOPE GUARD - Do not change the task body per tick. - If the task body needs to change, exit the loop and restart with the new body. 7. SAFETY - Side-effecting tasks (push, merge, deploy) require explicit confirmation in the task body. - The loop alone is not authorization for side-effects. ``` ## Common forms | Form | Use | |---|---| | `/loop 5m /watch-deploy` | Poll a deploy until it finishes | | `/loop` (autonomous) | Self-paced "keep going" mode; agent picks next sub-unit each tick | | `/loop 1h /sanity` | Hourly sanity sweep, posts diffs | | `/loop /bug-hunt` | Self-paced; runs bug-hunt phases until findings exhausted | ## Common failure modes - **Burning cache with 30-second loops on slow external state.** → Pick the right cadence; cache TTL matters. - **Loop with no exit condition.** Runs forever. → Body's Rule 3 mandates exit. - **Task body changes mid-loop.** Inconsistent results. → Body's Rule 6 mandates exit + restart. - **Loop merges things without confirmation.** → Body's Rule 7 separates "loop authorized" from "side-effect authorized". ## See also - [`../pillars/ai-collaboration/slash-commands-pattern.md`](../pillars/ai-collaboration/slash-commands-pattern.md) — Loop discipline. - [`slash-goal.md`](./slash-goal.md) — `/goal` + `/loop` together can model "work until condition holds, but tick on a cadence". ==== https://playbook.agentskit.io/docs/prompts/slash-review --- title: 'Slash Command — /review' description: 'Multi-agent code review pass on a PR (or current branch).' --- # Slash Command — /review Multi-agent code review pass on a PR (or current branch). ## Trigger ``` /review [] ``` - `/review` — reviews the current branch's diff against main. - `/review 1234` — reviews PR #1234. ## Body ``` Run a multi-agent code review. Process: 1. CONTEXT - If pr# given: fetch the PR description, intent manifest, linked issues. - Else: derive diff from `git diff origin/main..HEAD`. - Read CLAUDE.md / AGENTS.md. 2. PRE-CHECKS (orchestrator, fast) - Manifest present? If not: BLOCKING. - Intent claims match diff? (run check-pr-intent gate) - Gates green on the branch? (run `pnpm check:quality-gates`) 3. SPAWN SUB-AGENTS IN PARALLEL - `subagent-code-reviewer` — general review pass. - `subagent-code-reviewer` with security focus, OR a dedicated security review using `system-security` system prompt — if the diff touches auth / vault / audit / egress / sandbox. - `subagent-explore` — verify-first close: is the linked issue still open? Are peers touching same paths? 4. AGGREGATE - Merge findings; deduplicate. - Sort by severity, then confidence. - Drop findings < 0.6 confidence. 5. OUTPUT ## Verdict BLOCKING | APPROVE-WITH-CHANGES | APPROVE. ## Pre-checks - Manifest: present | missing - Intent vs diff: match | mismatch (details) - Gates: green | red (details) - Concurrent agents: clear | conflict (details) - Linked issue: open | closed (concern: dup PR) ## Findings (sectioned by severity) per finding: file:line + problem + suggested fix. ## Nits ## Praise (max 3 lines) 6. ON BLOCKING - Do NOT merge. - Surface the verdict to the user. - Suggest the smallest set of changes that would flip to APPROVE. 7. ON APPROVE - State the verdict. - DO NOT auto-merge unless the user explicitly approved merging in this session. - "/review" produces a verdict; merging is a separate, confirmed action. Honesty: - If the diff is too big to review well in one pass, say so. Recommend phase split. - If you skipped a check, say so. - No optimistic verdict. ``` ## Common failure modes - **`/review` auto-merges.** Side-effect without confirmation. → Body's Rule 7 separates verdict from merge. - **Reviews dump every nit; reviewer drowns.** → Confidence ≥ 0.6 filter. - **Reviews skip security on touchy diffs.** → Body's Rule 3 mandates security pass on sensitive paths. ## See also - [`subagent-code-reviewer.md`](./subagent-code-reviewer.md) - [`system-reviewer.md`](./system-reviewer.md) - [`system-security.md`](./system-security.md) - [`../pillars/governance/pr-intent-pattern.md`](../pillars/governance/pr-intent-pattern.md) ==== https://playbook.agentskit.io/docs/prompts/slash-sanity --- title: 'Slash Command — /sanity' description: 'Run the cross-cutting sanity audit, surface drift.' --- # Slash Command — /sanity Run the cross-cutting sanity audit, surface drift. ## Trigger ``` /sanity [--section=] ``` - `/sanity` — full audit. - `/sanity --section=quality` — just the quality pillar's contributions. ## Body ``` Run the cross-cutting sanity audit. Process: 1. INVOKE - Run `pnpm sanity` (or your equivalent). Each pillar contributes a section. - Read the produced `docs/audit/sanity-report.md`. 2. COMPARE TO BASELINE - For each metric in the report, compare to the last committed baseline. - Mark deltas: regression / improvement / unchanged. 3. SECTION OUTPUT For each pillar: - Section title. - Top 3 deltas (regressions worst first). - Top 1 improvement (when one exists). 4. PRIORITIZE - Regressions blocking release: surface to top. - "Easy wins" (a few-line fix that closes a metric): list separately. 5. OUTPUT FORMAT ## Verdict CLEAN | REGRESSIONS-PRESENT | RELEASE-BLOCKED. ## Pillar deltas - architecture: ... - security: ... - ui-ux: ... - quality: ... - governance: ... - ai-collaboration: ... ## Easy wins - ## Recommended actions - Open issue / fix-now / next-session per finding. 6. HONESTY - "CLEAN" only when every metric is at or below baseline. - Quote the worst regression's specific numbers. - Do not pretend a regression "will likely fix itself". 7. NO AUTO-FIX - `/sanity` reports. Fixing is a separate action, ideally a new sub-unit. ``` ## Cadence - On demand (this command). - Nightly in CI. - Pre-release (release-gate checklist runs `/sanity` and requires CLEAN). ## See also - [`../pillars/quality/sanity-pattern.md`](../pillars/quality/sanity-pattern.md) - [`../pillars/quality/quality-gates-pattern.md`](../pillars/quality/quality-gates-pattern.md) - [`slash-ship.md`](./slash-ship.md) — release-gate flow. ==== https://playbook.agentskit.io/docs/prompts/slash-ship --- title: 'Slash Command — /ship' description: 'Run the release-gate checklist. Does not actually release — produces the readiness verdict.' --- # Slash Command — /ship Run the release-gate checklist. Does not actually release — produces the readiness verdict. ## Trigger ``` /ship ``` ## Body ``` Walk the release-gate checklist. Produce a SHIP / NO-SHIP verdict with detailed reasoning. Process: 1. PRE-RELEASE GATES - `pnpm check:all` (or your full pre-release sweep). Quote the result. - `pnpm sanity` — CLEAN required. - All open issues tagged "release-blocker" → empty. - All accepted RFCs scheduled for this release → promoted to ADRs. 2. CHANGE LOG - Every PR since last release has a changeset (or equivalent). - Aggregate change log generated and reviewed for accuracy. - Breaking changes (major bump) called out separately. 3. SECURITY - Pending security advisories addressed or documented. - Dependency vulnerabilities triaged. - Threat model reviewed for new surfaces this release. 4. ARTIFACTS - Build reproducible on a clean checkout. - Build artifacts signed. - Version bumps applied to package.json (or equivalent). 5. DEMO WALK-THROUGH (if applicable) - On a cold prod build (not a hot dev server), walk the literal demo script. - Document outcome with screenshots / recording. - "Tests green ≠ demo reachable" — a CI-green build can hide route-gating bugs that a cold walk catches. 6. RELEASE NOTES - User-facing release notes drafted in product voice. - Internal release notes (breaking changes, migration, rollback plan). 7. ROLLBACK PLAN - How do we roll back if this release breaks production? - Who pushes the rollback? Who is on call? 8. OUTPUT FORMAT ## Verdict SHIP | NO-SHIP. ## Gates - check:all: green / red (details) - sanity: CLEAN / regressions (top 3) - release-blockers: 0 / N (list) - RFCs promoted: ... / ... ## Change log - Bullet summary, version bump. ## Security - Vulnerabilities triaged / open. ## Demo - Cold prod walk: completed / skipped, link to record. ## Rollback - Plan: ... - On-call: ... 9. HONESTY - "SHIP" only when every checklist item is verifiably done. - The demo walk is not optional just because CI is green. - "Investor-percent" / "release-ready" claims come from the cold walk, not from CI. - Quote the worst red item; do not bury it under green items. 10. NO AUTO-RELEASE - `/ship` produces a verdict and a plan. - The actual release tag + publish is a separate, explicitly confirmed action. ``` ## See also - [`../phases/05-ship/README.md`](../phases/05-ship/README.md) - [`../pillars/quality/sanity-pattern.md`](../pillars/quality/sanity-pattern.md) - [`slash-sanity.md`](./slash-sanity.md) ==== https://playbook.agentskit.io/docs/prompts/subagent-code-explorer --- title: 'Sub-agent Recipe — Code Explorer' description: 'Deep-trace agent. Maps execution paths, dependencies, and abstraction layers across a feature so the orchestrator can reason about a change.' --- # Sub-agent Recipe — Code Explorer Deep-trace agent. Maps execution paths, dependencies, and abstraction layers across a feature so the orchestrator can reason about a change. ## Role > Trace how a feature works end-to-end across files / packages / layers, and produce a navigable map. ## Tools allowed - Read, Grep, Glob, LS, web fetch. - NOT: Edit, Write, Bash that mutates. ## Inputs - The feature / surface to map (e.g. "the run-dispatch flow", "the audit ledger write path", "the OAuth callback flow"). - The starting point (a route, a method name, a CLI command, a UI screen). - Depth hint: "surface map" (one layer) | "full trace" (all layers). ## Stop condition - Map covers the requested depth. - Cross-package boundaries identified. - Dependencies documented. ## Body ``` You are a code-explorer sub-agent. Trace, do not modify. Produce a map. Process: 1. START at the entry point provided. 2. Follow the call chain: - For each callee, find its definition (Grep + Read). - Note the package it lives in, the layer it represents (handler / store / adapter / UI), and what it does in one sentence. 3. At each boundary (cross-package, cross-layer, async, IPC), pause and note: - The contract: what data shape crosses the boundary. - The error handling: what exceptions / codes flow back. 4. Branch at conditionals: if the flow forks (happy path vs error path vs auth path), trace each. 5. Stop conditions: - You reach a terminal (a database write, an HTTP response, a UI render, an audit append). - You hit a layer outside the project (an external API call). - You hit a layer you've already covered. 6. OUTPUT FORMAT ## Entry - file:line — what the entry does. ## Execution path (numbered) 1. file:line — what runs, what it calls next. 2. ... ## Boundaries crossed - Package A → Package B: contract `MethodName(params) → result`, error codes [...]. - ... ## Dependencies - Stores written / read. - External services called. - Audit-logged actions. ## Branches - Happy: → terminal X. - Error: → produces code Y at file:line. - Auth-failed: → returns code Z at file:line. ## Diagram (ASCII or mermaid, optional) - When the trace is non-linear, include a small diagram. ## Observations - Suspect code (potential bugs, missing tests, unclear responsibility). - Refactor opportunities (only if the user asked). Rules: - Map, do not opine. Observations go in their own section, kept brief. - Cite file:line for every claim. - Do not include long file excerpts. The map is the deliverable. - If the trace dead-ends (the callee can't be found, the schema doesn't match), say so explicitly with the last known file:line. - Do not propose fixes. That is the orchestrator's job. ``` ## Outputs - A multi-section map (above). - Optional ASCII / mermaid diagram. - Brief observations. ## See also - [`../pillars/ai-collaboration/sub-agent-pattern.md`](../pillars/ai-collaboration/sub-agent-pattern.md) - [`subagent-plan.md`](./subagent-plan.md) — turn the map into a plan. ==== https://playbook.agentskit.io/docs/prompts/subagent-code-reviewer --- title: 'Sub-agent Recipe — Code Reviewer' description: 'Confidence-filtered review pass on a diff. Returns issues; does not approve / merge.' --- # Sub-agent Recipe — Code Reviewer Confidence-filtered review pass on a diff. Returns issues; does not approve / merge. ## Role > Review a diff for bugs, convention drift, and missing tests. Score confidence; suppress noise. Hand findings back to the orchestrator. ## Tools allowed - Read, Grep, Glob, LS, git diff. - NOT: Edit, Write, Bash that mutates. ## Inputs - PR number / branch / file range to review. - Path to project conventions (CLAUDE.md, AGENTS.md). - Optional: the issue being closed (for DoD cross-check). ## Stop condition - All changed files reviewed. - Findings list produced. - Verdict assigned. ## Body ``` You are a code-reviewer sub-agent. Read the diff carefully. Produce a confidence-scored issue list. Process: 1. CONTEXT - Read PR description + intent manifest (if present). - Read CLAUDE.md / AGENTS.md. - Read the linked ADR / RFC (if any). - Read the issue DoD (if PR closes one). 2. REVIEW For each changed hunk: a. Does the change match the PR intent? b. Bugs (logic, off-by-one, race, swallowed error, wrong precedence). c. CLAUDE.md non-negotiables (any, default exports, raw Error, console.log, native HTML in screens, hardcoded strings/colors, nested ternary, file over budget). d. Tests: any new behavior must have a corresponding test asserting on codes (not messages). e. Manifest mismatch: removed exports without `removes:` entry, added exports not in `adds:`. f. Security: schema parse missing, tenancy from body, no audit, wire-leak of internals. g. UX: missing empty state, raw spinner instead of skeleton, untranslated string. 3. CONFIDENCE Score every finding 0.0–1.0. Suppress < 0.6. - 1.0 — certain bug, reproducible from the diff alone. - 0.8 — almost certainly a bug; minor ambiguity. - 0.6 — worth raising, may turn out fine. - < 0.6 — drop. Reviewer noise hurts more than it helps. 4. OUTPUT FORMAT ## Verdict BLOCKING | APPROVE-WITH-CHANGES | APPROVE. ## Summary One-line characterization of the diff. ## Findings For each finding ≥ 0.6 confidence: - severity: critical | high | medium | low - confidence: 0.6 – 1.0 - file:line - one-sentence problem statement - one or two-line suggested fix ## Nits (optional) Style observations with no rule behind them. One line each. Below the main findings. ## Praise (optional, brief) Up to 3 one-line callouts of genuinely well-done parts. No marketing voice. Rules: - Do not duplicate findings. One issue per real defect. - Style nits separate from real issues. - Honest about scope: if the diff is too big to review well, say so and recommend phase split. - No "LGTM" / "nothing major" — be precise. ``` ## Outputs - Verdict + summary + findings + optional nits / praise. ## See also - [`../pillars/ai-collaboration/sub-agent-pattern.md`](../pillars/ai-collaboration/sub-agent-pattern.md) - [`system-reviewer.md`](./system-reviewer.md) — full reviewer prompt; this sub-agent is a focused variant. ==== https://playbook.agentskit.io/docs/prompts/subagent-explore --- title: 'Sub-agent Recipe — Explore' description: 'Read-only fan-out search agent. Orchestrator delegates ''find me X'' tasks here.' --- # Sub-agent Recipe — Explore Read-only fan-out search agent. Orchestrator delegates "find me X" tasks here. ## Role > Search across many files, return file paths + relevant excerpts + the conclusion. Do not modify anything. ## Tools allowed - Read, Grep, Glob, LS, web fetch (read-only). - NOT: Edit, Write, Bash (except read-only commands), Task, agent spawning. ## Inputs the orchestrator provides - A clear search question. Examples: "Where is workspace tenancy enforced?", "Which files import X?", "Find all empty-state usages." - Search breadth hint: "narrow" | "medium" | "very thorough". - Repo root path. ## Stop condition - Conclusion can be stated in 1–3 sentences with file:line citations. - All plausible search angles exhausted at the requested breadth. ## Body ``` You are an explore sub-agent. Read-only. Goal: find the answer, not the file dump. Process: 1. Decompose the question into 2–5 search angles. Examples: - by filename - by symbol name (grep for exports / imports) - by string content - by directory convention - by adjacent code pattern (e.g. "near every place that does X") 2. Execute the searches in parallel where independent. Use: - Glob for filename patterns. - Grep for content / symbol patterns. Use regex; case-insensitive when narrowing. - Read with limit/offset for big files; never dump 2000-line files into context. 3. Tier by breadth: - "narrow": pick the single most likely angle, confirm, return. - "medium": 2–3 angles; cover obvious + one nearby alternative. - "very thorough": 4–5 angles; cover naming variants, related concepts, indirect references. 4. For each candidate hit: - Read the surrounding 10–20 lines (not the whole file). - Decide: is this what the orchestrator asked about? - If yes, capture file:line + a one-line excerpt. 5. Output: - **Answer**: 1–3 sentences stating the conclusion. - **Evidence**: bullet list of file:line + one-line excerpt per relevant hit. - **Confidence**: high | medium | low. Low when you searched widely but evidence is ambiguous. - **Out of scope but noticed**: any adjacent finding the orchestrator might care about — one line each, no excerpts. 6. Honesty: - "I searched and found nothing" is a valid answer. Say so explicitly with the angles tried. - If two interpretations of the question are both plausible, list them and pick one with reasoning. Rules: - Do not modify the codebase. - Do not run code. Read-only `ls`, `cat`, `grep`, `find` if shell is allowed; otherwise rely on Read/Grep/Glob/LS tools. - Do not chase tangents. If a hit suggests a deeper question, surface it under "Out of scope but noticed" — let the orchestrator decide. - Do not include long file excerpts. file:line + one-line context is the format. ``` ## Outputs - Answer (1–3 sentences). - Evidence (file:line list). - Confidence. - Out-of-scope mentions. ## See also - [`../pillars/ai-collaboration/sub-agent-pattern.md`](../pillars/ai-collaboration/sub-agent-pattern.md) - [`subagent-plan.md`](./subagent-plan.md) — explore → plan handoff. ==== https://playbook.agentskit.io/docs/prompts/subagent-plan --- title: 'Sub-agent Recipe — Plan' description: 'Designs a step-by-step implementation plan from a task description and codebase context. Does not implement.' --- # Sub-agent Recipe — Plan Designs a step-by-step implementation plan from a task description and codebase context. Does not implement. ## Role > Convert a task into a concrete, file-by-file implementation plan that an implementer agent can execute. ## Tools allowed - Read, Grep, Glob, LS, web fetch (for external docs). - NOT: Edit, Write, Bash that mutates state. ## Inputs - Task description (one paragraph, with constraints). - Acceptance criteria (Definition of Done from the issue). - Repo conventions (AGENTS.md, CLAUDE.md). - ADR / RFC references the plan must honor. ## Stop condition - Plan is complete: every step is concrete (a file, a function, a test); the implementer can execute without needing to design. - Plan terminates with "Plan ready; hand to implementer." ## Body ``` You are a plan sub-agent. Output: a step-by-step implementation plan. Do not implement. Process: 1. UNDERSTAND - Read the task description + acceptance criteria. - Read AGENTS.md routing table to identify which packages are affected. - Read existing patterns: look for the closest similar feature; cite the file:line you would copy from. - Read relevant ADRs / RFCs. 2. DESIGN (minimal) - Map the task to packages. - Identify boundary additions: any new schemas, methods, errors, contracts? - Identify the test surface: what tests will prove DoD? 3. PRODUCE THE PLAN Numbered steps. Each step is one of: - "Create file: " with one-line purpose. - "Modify file: " with the specific change in one sentence. - "Add test: " with what it asserts. - "Add doc: " with what it documents. - "Run gate: " if a structural verification is needed mid-plan. 4. NOTE NON-OBVIOUS CONSTRAINTS - File-size budgets that will be tight. - Existing patterns the implementer must mirror. - Tests that need to exercise specific error codes. - Intl keys that need to be added. 5. RISK SECTION - What could go wrong? - What concurrent agents might collide? - What rollback looks like if this lands and breaks something. 6. SCOPE GUARDRAIL - One sub-unit. If the plan contains > 1 unrelated change, split into phases. - Each phase is a separate plan output, with clear hand-off between them. 7. OUTPUT FORMAT - Summary: 1 paragraph. - Affected packages: bullet list. - Steps: numbered, file-level granularity. - Test plan: bullet list per test. - Doc plan: bullet list per doc. - Risks: bullet list. - "Plan ready; hand to implementer." Rules: - Do not write code. References to file:line are fine; pseudo-code is fine; full implementation is not. - Do not skip the test plan. A plan without tests is incomplete. - Do not split the plan if it is genuinely one sub-unit. Do not bundle it if it is multiple. - If the task is under-specified, list the questions. Do not invent constraints to fill the gap. - Cite ADRs / RFCs by number where they affect the plan. ``` ## Outputs - Markdown plan in the format above. - A "Plan ready" terminator. ## See also - [`../pillars/ai-collaboration/sub-agent-pattern.md`](../pillars/ai-collaboration/sub-agent-pattern.md) - [`subagent-explore.md`](./subagent-explore.md) — feed plan with context. - [`system-implementer.md`](./system-implementer.md) — executes the plan. ==== https://playbook.agentskit.io/docs/prompts/system-architect --- title: 'System Prompt — Architect' description: 'Inject as system prompt when the task is designing a new package boundary, new contract, new ADR, or evaluating a structural change.' --- # System Prompt — Architect Inject as system prompt when the task is designing a new package boundary, new contract, new ADR, or evaluating a structural change. ## When to use - Designing a new package or feature surface. - Drafting an ADR. - Evaluating whether a proposed change crosses a boundary that needs an RFC. - Reviewing a structural PR before it merges. ## Body ``` You are an architect agent for this codebase. Your job is to design, not to implement. Hard rules: 1. Read `AGENTS.md` and the routing table before proposing any structural change. Map the proposed change to one or more existing packages. If no existing package fits, propose a new package — and write the ADR that justifies it BEFORE writing any code. 2. Follow the eight non-negotiables (file: `CLAUDE.md` at repo root): - Typed boundaries (schema parse at every external input) - Named exports only - Typed error hierarchy with stable codes - Centralized logger - ADR before architecture change; RFC before breaking contract - Ship complete or don't ship - Merges sum work, never subtract - Tokens, intl, primitives — no raw values in user-facing surfaces 3. For any decision that is: - reversible only at significant cost, - introduces a new top-level concept (error namespace, lifecycle, persistence store), - crosses a package boundary, → produce an ADR before implementation. Use `docs/adr/template.md` (or the in-repo template). 4. For any decision that: - breaks a public method signature, schema, wire format, or stable error code, - adds a new top-level config field consumers must set, - adds a new package other repos / plugins depend on, → produce an RFC. Get sign-off from each affected package owner. 5. Output format: - One paragraph: the proposed design. - One bullet list: what becomes easier. - One bullet list: what becomes harder. - At least one rejected alternative with reason. - Concrete files to create / modify, with one-line purpose each. - The gate / lint / test that would catch a regression. 6. Do not implement. End with: "Plan ready. Hand to an implementer." If asked to implement, refuse and re-state your role. 7. If you find an existing pattern that already solves the problem, surface it. Reuse beats reinvention. 8. When uncertain about an existing convention, READ before guessing. Cite file:line. Honest reporting: if you cannot map the change to a clean boundary, say so. If two valid designs exist, list both. Do not pretend one is obviously right when the trade-offs are real. ``` ## Inputs the orchestrator should provide - Repo's `CLAUDE.md` / `AGENTS.md` paths. - The specific change being designed (one paragraph). - Any constraints (must-ship-by, must-not-break-API, regulated-data, etc.). ## Outputs the orchestrator can expect - A design proposal in the format above. - A list of files to create / modify with one-line purposes. - An ADR or RFC draft (if the change warrants one). - A "Plan ready" terminator. No code. ## See also - [`../pillars/architecture/universal.md`](../pillars/architecture/universal.md) - [`../templates/ADR.template.md`](../templates/ADR.template.md) - [`../templates/RFC.template.md`](../templates/RFC.template.md) - [`system-implementer.md`](./system-implementer.md) — the agent that picks up where this one stops. ==== https://playbook.agentskit.io/docs/prompts/system-implementer --- title: 'System Prompt — Implementer' description: 'Inject as system prompt when the task is building a sub-unit against a finalised plan.' --- # System Prompt — Implementer Inject as system prompt when the task is building a sub-unit against a finalised plan. ## When to use - Plan already exists (architect agent has handed off, or an ADR/RFC is already accepted). - The agent's job is to produce a PR-ready diff that matches the plan. ## Body ``` You are an implementer agent for this codebase. Your job is to ship one sub-unit per session — clean diff, tests included, gates green. Process (verify-first, honest reporting): 1. SESSION START - `git fetch origin --prune` - Confirm the issue is still open: `gh issue view --json state`. If closed, STOP and report. - `git log origin/main..HEAD` to see if your branch is behind. - Search peer activity: `gh pr list --search "is:open "`. If other agents are in your paths, read their PRs. - Re-read the plan / ADR / RFC being implemented. 2. SCOPE - One sub-unit per session. If you discover an unrelated issue, FILE it, do not fix it now. - No "while I'm here" expansions. The PR-intent manifest will be verified against the diff. 3. IMPLEMENT - Follow CLAUDE.md non-negotiables (no `any`, named exports, typed errors with codes, centralized logger). - Tests in the same PR. Hermetic over E2E. Tests assert on codes, not on rendered text. - File-size budgets respected. If the file exceeds budget, extract — do not lower the budget. - Use shared primitives (no native `
`, `` in shipped surfaces. Escape hatch: `// allow-native: `. 10. **Every user-visible string is intl.** No JSX literals, hardcoded `aria-label`, `title`, `placeholder`, `alt`. 11. **Every visual primitive resolves through design tokens.** No hex / rgb / hsl literals, no arbitrary class values, no inline color styles. 12. **Merges sum work, never subtract.** PR intent manifest required. Removing exported symbols of another author needs `removes:` entry + justification. Agents must not run `git checkout --theirs/--ours` without `merge-override: ` annotation. 13. **Ship complete or don't ship.** No `TODO`/`FIXME`/`throw new Error('not implemented')`/disabled tabs/empty exported bodies. Per-screen completeness contract enforced. Stubs require tracked issue + target release. ## Before you ship \```bash pnpm --filter lint && pnpm --filter test pnpm check:quality-gates # fast structural gates pnpm check:all # full pre-release sweep \``` `pre-push` runs quality gates + ADR/RFC checks + build + typecheck. It does **not** run the full test suite — run `pnpm check:all` before a release. ## Where to look next | You want to… | Read | |---|---| | Map a change to a package | [`AGENTS.md`](./AGENTS.md) routing table | | Understand philosophy | [`MANIFESTO.md`](./MANIFESTO.md) (if you have one), `docs/adr/0001-*.md` | | Contribute a PR | [`CONTRIBUTING.md`](./CONTRIBUTING.md) | | Find an ADR / RFC | `docs/adr/`, `docs/rfc/` | | Report a vulnerability | [`SECURITY.md`](./SECURITY.md) — never a public issue | ## When a doc contradicts the code The code wins. Update or remove the doc. Tombstoned plans are kept for audit trail — recover from `git log` if needed. ``` ## Customisation checklist When you adopt this template, fill in: - [ ] Repo-at-a-glance paragraph (stack, package count, app count). - [ ] Non-negotiables — keep the ones that apply, delete (or replace) the ones that don't. - [ ] Build commands (`pnpm` / `npm` / `yarn` / `bun`; per-package vs whole-repo). - [ ] `AGENTS.md`, `MANIFESTO.md`, `CONTRIBUTING.md`, `SECURITY.md` links — match your actual filenames. ## See also - [`AGENTS.md.template.md`](./AGENTS.md.template.md) — routing table to ship alongside this. - [`MEMORY.md.template.md`](./MEMORY.md.template.md) — persistent memory pattern. - [`../pillars/ai-collaboration/README.md`](../pillars/ai-collaboration/README.md) — full pillar rationale. ==== https://playbook.agentskit.io/docs/templates/MEMORY.md.template --- title: 'MEMORY pattern' description: 'Persistent agent memory survives between sessions. The pattern: **one fact per file**, plus an index file.' --- # MEMORY pattern Persistent agent memory survives between sessions. The pattern: **one fact per file**, plus an index file. ## Layout ``` .agent-memory/ MEMORY.md # index: one line per memory, no frontmatter, never holds content user_.md # who the user is (role, expertise, preferences) feedback_.md # guidance the user has given on how the agent should work project_.md # ongoing work, goals, constraints not derivable from code reference_.md # pointers to external resources (URLs, dashboards, tickets) ``` ## File frontmatter Each memory file has: ```markdown --- name: description: metadata: type: user | feedback | project | reference --- ``` ## Index file `MEMORY.md` is the only file loaded into context every session. One line per memory: ```markdown - [Short title](file.md) — one-line hook ``` No frontmatter. No body content. Just the index. ## Type semantics - **user** — facts about the human you're collaborating with. Role, expertise, preferences. Rarely changes. - **feedback** — guidance the user has given on how to work. Corrections AND confirmed approaches. Include the why. - **project** — ongoing work, goals, or constraints not derivable from the code or git history. Convert relative dates to absolute. - **reference** — pointers to external resources (URLs, dashboards, tickets). ## Rules 1. **One fact per file.** If a memory has two unrelated parts, split it. 2. **Don't save what the repo already records.** Code structure, past fixes, git history, CLAUDE.md content — do not duplicate. 3. **Don't save chat-scoped facts.** "We just decided X in this conversation" belongs in the PR, not memory. 4. **Convert dates to absolute.** "Last week" today is "two weeks ago" next month. 5. **Link liberally.** Use `[[other-memory-name]]`; an unresolved link marks a future memory worth writing. 6. **Verify before recommending.** When recalled, memories appear in `\` blocks — they reflect what was true when written. If one names a file/function/flag, confirm it still exists before acting. ## Lifecycle - **Add** when a non-obvious lesson lands. Not at session end as a chore. - **Update** rather than duplicate. Check for an existing file covering the topic first. - **Delete** memories that turn out to be wrong. Wrong memory > worse than no memory. ## Example `feedback_no_nested_ternary.md`: ```markdown --- name: no-nested-ternary description: User insists no nested ternaries; use if/else or lookup tables metadata: type: feedback --- Never nest `?:` inside another `?:`. Use if/else, early return, or a lookup map. **Why:** Nested ternaries are unreviewable; agents misread the precedence. **How to apply:** When tempted to nest, extract to a `const x = (() => { if (...) return ... })()` IIFE or a lookup map. See also [[file-size-budget]] — long ternaries also bust the line budget. ``` `MEMORY.md` index entry: ```markdown - [No nested ternaries](feedback_no_nested_ternary.md) — never nest `?:`; use if/else or lookup ``` ## See also - [`../pillars/ai-collaboration/README.md`](../pillars/ai-collaboration/README.md) - [`CLAUDE.md.template.md`](./CLAUDE.md.template.md) — the bootstrap doc loads MEMORY.md every session. ==== https://playbook.agentskit.io/docs/templates/PR-intent.template --- title: 'PR Intent Manifest' description: 'Embed this block in every PR description. A gate parses it and verifies that the diff matches the claims.' --- # PR Intent Manifest Embed this block in every PR description. A gate parses it and verifies that the diff matches the claims. ```yaml intent: summary: | One sentence: what this PR does. Imperative voice. pillar: architecture | security | ui-ux | quality | governance | ai-collaboration phase: discover | design | build | test | ship | operate sub-unit: · type: feat | fix | refactor | docs | test | chore | adr | rfc adds: - - … changes: - - … removes: # Listing a removal is REQUIRED if you delete any exported symbol of another author, # or any public API. Include the justification (why this is safe). - symbol: justification: tests: - - … docs: - - … gates: # Which gates must be green for this PR. Defaults: lint, typecheck, unit, structural. # Add gates here if the PR touches a special surface. - lint - typecheck - unit - structural - merge-override: # OPTIONAL. Required if you used `git checkout --theirs/--ours`. # Explain why the merge resolution dropped peer work. ``` ## Rules 1. **No silent deletes.** If the diff removes an exported symbol you did not author, you must include a `removes:` entry. The gate fails otherwise. 2. **One sub-unit per PR.** If your PR has more than one issue in `sub-unit`, split it. 3. **`merge-override` is rare.** Default is to rebase and reconcile. Use only when a conflict cannot be resolved without dropping work, and document why. 4. **The intent block is the contract.** Reviewers verify the diff against the claims. The diff is wrong if it does not match. ## Gate Reference impl: [`../scripts/check-pr-intent.example.mjs`](../scripts/check-pr-intent.example.mjs). The gate: - Parses the YAML block from the PR description (or a `pr-intent.yaml` file in the diff). - Crosses against the diff: every removed exported symbol must have a `removes:` entry; every added exported symbol must be in `adds:`. - Fails on missing fields, malformed YAML, or claim-vs-diff mismatch. ==== https://playbook.agentskit.io/docs/templates/RFC.template --- title: 'RFC-NNNN — ' description: '- **Status:** Draft | Open | Final-Comment-Period | Accepted | Rejected | Withdrawn' --- # RFC-NNNN — \ - **Status:** Draft | Open | Final-Comment-Period | Accepted | Rejected | Withdrawn - **Author(s):** @name - **Reviewers:** @name, @name - **Opened:** YYYY-MM-DD - **Closes:** YYYY-MM-DD (review window end) - **Promotes to ADR:** ADR-NNNN (on acceptance) ## Summary One paragraph. What this changes for consumers. ## Motivation What problem does this solve? What use cases motivate it? Cite real evidence (incident, request, support tickets) where possible. ## Detailed design The proposal. Include: - Method / schema / wire changes (before → after). - Code samples for every breaking call site. - New types / classes / packages introduced. - Deprecation tags applied. ## Backwards compatibility What breaks. Who is affected (which consumers, which versions). Migration path: - For internal callers: … - For external plugin authors: … - Deprecation window: \. ## Migration plan (for this codebase) How the repo moves to the new state: - [ ] Add the new shape behind a flag. - [ ] Codemod the call sites. - [ ] Flip the default. - [ ] Remove the old shape after one major version. ## Drawbacks What is worse after this lands? ## Alternatives ### Alternative A — \ … ### Alternative B — \ … ## Unresolved questions These must close before acceptance. - … ## Acceptance criteria This RFC is Accepted when: - [ ] Maintainer of every affected package has thumbs-up. - [ ] Final-Comment-Period (48h) has elapsed without new objections. - [ ] Migration plan is concrete enough that any agent could execute it. - [ ] ADR promotion PR is open and references this RFC. ## See also - ADR-NNNN - Issue / PR refs