How to mix test types so the cheap ones catch most bugs and the expensive ones cover what only they can.

Test Pyramid

How to mix test types so the cheap ones catch most bugs and the expensive ones cover what only they can.

TL;DR (human)

Five tiers, ordered by cost. Spend most of your budget on tiers 1–2. Reserve tier 5 for golden paths and cross-process boundaries. The pyramid is not dogma — it is cost optimization for catch rate.

For agents

Tiers

Tier	Type	Runtime	What it catches
1	Schema parse / contract	<1 ms each	Wrong shapes, missing fields, type-vs-runtime drift
2	Unit (pure functions, single class)	<10 ms each	Logic errors, off-by-one, edge cases
3	Integration (handler + store + adapter, in-process)	<500 ms each	Boundary mismatches, transaction-ordering, missing wiring
4	Visual regression / a11y	seconds each	Token drift, component layout regressions, a11y violations
5	E2E (real app, real services)	minutes each	Cross-process bugs, real-world flow integrity

Where to spend

Rough budget for a healthy codebase:

70% of test count: tier 1–2.
25%: tier 3.
4%: tier 4.
1%: tier 5.

If your suite inverts this — 60% E2E, 10% unit — your runtime is long, your signal is flaky, and your debugging surface is huge.

Which tier catches which bug

When a bug is reported, pick the lowest tier that can pin it:

Could a schema parse test reject the bad input? → Add tier 1 test.
Could a unit test fail on the wrong logic? → Add tier 2 test.
Does the bug appear only when handler + store interact? → Tier 3.
Does the bug show only in the rendered DOM? → Tier 4.
Does the bug live in cross-process handoff or browser-only behavior? → Tier 5.

Always escalate to the higher tier only after the lower tier cannot pin it.

Test names

A test name reads like a sentence:

describe("users.list handler", () => {
  it("rejects missing workspaceId with VALIDATION_ERROR", ...)
  it("returns empty rows when no users in workspace", ...)
  it("respects limit and cursor for pagination", ...)
})

Anti-pattern: it("works"), it("test 1"). Agent-produced tests with these names are a smell — they tested the wrong thing.

Determinism

Every test runs in isolation, in any order, in parallel, with no shared state.

No file system writes outside a per-test temp dir.
No network calls (mock the boundary).
No timer / clock drift (inject the clock).
No global module state.

If a test passes in isolation and fails in parallel, the test has hidden global state. Fix the test, not the order.

property("any valid input round-trips through parse + serialize", ...)

One property test can replace fifty unit tests. Reserve for high-value boundaries.

Mutation as a coverage backstop

After unit suite stabilises, mutation testing scores its real catch rate. See mutation-testing-pattern.md.

Common failure modes

Tests that only assert on rendered text. Break on intl / copy changes. → Assert on structure or codes.
Tests that mock too deep. End up testing the mocks. → Mock at the trust boundary only.
Tests that share fixtures via mutation. Order-dependent. → Fresh fixtures per test, or immutable fixtures.
E2E flake "fixed" by a sleep. Flake hidden, not fixed. → Find the deterministic signal; assert on that. expect.poll() / waitFor() over fixed sleeps.
Coverage 95% but error codes never asserted. The error path is untested. → A separate gate scans tests for code: assertions; flags codes that are never asserted.

Test Pyramid

Test Pyramid

TL;DR (human)

For agents

Tiers

Where to spend

Which tier catches which bug

Test names

Determinism

Fixtures

Coverage interpretation

Property-based testing

Mutation as a coverage backstop

Common failure modes

See also

On this page