Agents Playbook

Adversarial Bug-Hunt Pattern

How to find real logic bugs with a find → refute → reproduce loop, instead of trusting a single agent that says "looks fine".

View raw .md

Adversarial Bug-Hunt Pattern

How to find real logic bugs with a find → refute → reproduce loop, instead of trusting a single agent that says "looks fine".

TL;DR (human)

Coverage and mutation testing tell you whether your tests are good. They do not find logic bugs that no test was ever written for. To find those, run a deliberate hunt: assign each finder a single orthogonal bug-class lens, then make a second wave of agents try to refute every candidate, then require a failing reproduction before anything is called a bug. One agent asked to "find bugs" produces confident noise; an adversarial loop produces a short list of confirmed defects.

For agents

Why a single pass fails

Ask one agent "are there bugs in this module?" and you get one of two bad outcomes:

  • False confidence. "No issues found" — because it read the happy path and stopped.
  • Plausible noise. A list of ten "bugs", most of which are intended behavior, equivalent rewrites, or misreadings of the contract.

Neither is actionable. The fix is structure: orthogonal search, adversarial verification, and a reproduction bar.

Stage 1 — Find (orthogonal lenses)

Do not ask N agents to "find bugs". They will all find the same shallow thing. Assign each finder one bug-class lens so coverage is orthogonal:

LensWhat it hunts
Boundary / off-by-one< vs <=, inclusive/exclusive ranges, empty and single-element inputs
Error-handling holesswallowed errors, wrong error code, missing reject path, catch {} that hides, partial rollback
Async / race / idempotencymissing await, concurrent mutation, double-fire, retry-without-dedup, check-then-act (TOCTOU)
Security logicauthz allow/deny inversion, tenant-scope leak, egress/firewall bypass, skipped signature or expiry check, redaction gap — highest priority
State machine / invariantsillegal transition allowed, terminal-state mutation, lost update, stale cache
Input validationschema boundary not enforced on the runtime path, unsanitised passthrough, type-coercion surprise
Resource / cleanupunclosed handles, unbounded growth, leak, missing dispose / cancel

Each finder returns candidates as file:line — symptom — why it's wrong.

Target the highest-yield code first: security and tenant-scoping, then runtime correctness (state machines, sync/conflict logic), then money and contracts, then app-level auth/billing. UI rendering is low-yield for this technique — leave it to visual and E2E tests.

Stage 2 — Refute (adversarial verification)

This is the stage that kills noise. For each candidate, spawn at least two independent skeptics whose explicit job is to refute it. Prompt them to default to "not a bug" when uncertain. Majority-refute → drop the candidate.

Give the skeptics distinct angles rather than cloning one refuter — e.g. one checks the contract/docs, one checks whether a test already codifies the behavior as intended, one tries to construct an input where the code is actually correct. Diverse skeptics catch failure modes that identical ones miss.

A candidate that survives independent refutation is worth a reproduction. One that does not is discarded without ceremony — do not "soften" it into a medium-severity note to feel productive.

Stage 3 — Reproduce + confirm

A survivor is not a bug until a failing test reproduces wrong behavior against the documented or intended contract. The repro is the proof and becomes the regression test once fixed.

For each confirmed bug record: location — symptom — repro — severity (crit / high / med) — proposed fix. Fix confirmed crit/high inline (one commit per fix, with the repro test). Batch medium findings for a follow-up.

The intent-test guard

Before "fixing" a suspicious behavior, search for a test that already codifies it as intended. State-hygiene nits ("this stale field should be cleared", "this transition should throw") are frequently deliberate and pinned by an existing test. If such a test exists, the behavior is intended — you would be reverting a decision, not fixing a bug. No intent test + a failing contract repro = a real bug. See ../ai-collaboration/verify-first-pattern.md.

Loop until dry

Run the find → refute → confirm loop per module until two consecutive rounds surface nothing new. Dedup by file:line + symptom so the same finding doesn't re-enter every round. A fixed iteration count ("run it 3 times") misses the long tail; a dry-rounds counter converges honestly.

When to use it

  • Before a release of security-, money-, or correctness-critical code.
  • After mutation testing and coverage are already good — this finds a different class of defect (real logic bugs, not weak tests).
  • When a subsystem has had concurrency or state-machine churn.

Not worth it for: pure UI rendering, generated code, throwaway scripts.

Complementary techniques

  • property-fuzz-testing-pattern.md — finds bugs the lenses miss by attacking invariants and parsers directly.
  • Run the integration / E2E suites that CI skips when it is constrained — real wiring bugs hide in the layers unit tests never exercise.

Common failure modes

  • "Find bugs" with no lens. Every agent finds the same shallow thing. → Assign one orthogonal lens each.
  • No refutation stage. Plausible-but-wrong findings ship as "bugs". → Always run independent skeptics; majority-refute drops.
  • No reproduction bar. A claim becomes a "bug" without proof. → Require a failing test against the contract.
  • Fixing a pinned behavior. You revert a deliberate decision. → Check for an intent test first.
  • Fixed round count. The tail is missed. → Loop until two dry rounds.
  • Empty verdicts counted as "survived". A verification pass that didn't actually run reads as "0 refuted" and inflates the confirmed list. → Treat a missing verdict as unverified, not as survived.

See also