Agents Playbook

Human-in-the-Loop Pattern

Design where humans review, approve, correct, and take over from agents — so autonomy scales with confidence and every correction becomes training signal.

View raw .md

Human-in-the-Loop Pattern

Design where humans review, approve, correct, and take over from agents — so autonomy scales with confidence and every correction becomes training signal.

TL;DR (human)

Full autonomy and full manual are both wrong defaults. Place humans at the points where the cost of an error exceeds the cost of a review: gate irreversible or high-stakes actions behind explicit approval, escalate low-confidence outputs instead of emitting them, and let humans edit before shipping. Then capture every approval, rejection, and edit as structured signal — the edit-diff and the rejection reason are the highest-value training data you own. The goal is to earn autonomy: start supervised, measure agreement, and widen the agent's unsupervised envelope as the evals prove it. The human is not a fallback; the human is the loop that makes the agent get better.

For agents

Why HITL is an architecture, not an afterthought

For output someone puts their name on, "the model probably got it right" is not a shipping criterion. But routing everything through a human throws away the agent's leverage. The design question is precise: at which decisions does a human add more value than they cost? Answer it per action, not globally — and make the answer move as confidence grows.

The interaction modes

ModeHuman roleUse when
Approve / reject (gate)Authorizes before executionAction is irreversible, costly, externally visible, or privileged
Edit-before-shipCorrects the draftOutput is usually-right and cheap to fix; the edit is signal
Escalate-on-uncertaintyTakes over the hard casesAgent's confidence / source-support is low
Review-after (audit)Spot-checks shipped outputVolume is high, stakes moderate, full gating too slow
AutonomousNone (sampled offline)Reversible, low-stakes, evals prove reliability

These are a ladder. A new capability starts near the top (gated); as its evals and online agreement prove out, it descends toward autonomous. Demote it back up the instant quality regresses.

Gate the irreversible

The hard rule: the blast radius of the action sets the gate, not the model's confidence. Sending an email, charging a card, deleting data, publishing externally, mutating production, spending real money, anything privileged — gate it, even if the agent is "sure." Reversible, sandboxed, low-cost actions can run free. This mirrors the security posture: privileged operations require explicit, audited authorization (see ../security/governance-posture-pattern.md, ../security/audit-ledger-pattern.md).

Approvals must be legible: show the human what will happen and why (the agent's reasoning + the action's effect), not a yes/no with no context. A rubber-stamp gate is theater.

Escalate, don't emit, on low confidence

Tie escalation to the confidence signals from the eval and hallucination layers:

  • Low source-support / failed faithfulness check → escalate rather than ship a likely fabrication.
  • Low self-consistency across samples → escalate.
  • Out-of-distribution input (unlike anything in the eval set) → escalate.

A handed-off "I'm not sure about this one" is cheap and trust-building; a confident wrong answer is expensive and trust-destroying. Calibrate the threshold against cost: too eager and you drown humans (alert fatigue → rubber-stamping); too reluctant and unsafe output leaks.

Every interaction is training signal — capture it

This is the part teams leave on the floor. Each human touch is labeled data:

InteractionSignalUse
Approved as-ispositive labelreinforce; candidate for autonomy
Edited then shippedthe diffexactly what was wronghighest-value: targeted prompt/eval fix
Rejectednegative + reasonfrozen eval case; failure-mode taxonomy
Escalated & resolveda hard case + its answernew eval case; was the threshold right?

Capture them structured, not as ad-hoc edits lost in a doc. The edit-distance and the rejection reason feed straight into ../quality/agent-eval-framework-pattern.md and the next prompt-versioning-pattern.md iteration. A HITL system that doesn't record its corrections is paying for human review and discarding the receipt.

Earn autonomy with measurement

Decide the autonomy envelope from agreement data, not gut:

  1. Run the capability gated; log agent-proposed vs. human-final.
  2. Measure agreement per slice (accept rate, edit-distance).
  3. Where agreement is high and stable, widen to edit-before-ship, then sampled-audit, then autonomous.
  4. Keep the harder slices gated.
  5. Regression on the online signal → tighten the gate immediately.

Autonomy is a privilege the evals grant and revoke — not a launch decision made once.

Design for the human's attention

HITL fails when it ignores the human's cognitive budget:

  • Don't flood. Too many approvals → fatigue → rubber-stamping → the gate is now noise. Batch, prioritize by stakes, auto-clear the obvious.
  • Make review fast. Surface the diff and the reasoning; let one keystroke approve/reject/edit. Review latency is a product metric.
  • Show provenance. Sources, tool calls, confidence — so the human can judge in seconds, not re-derive the work.

Common failure modes

  • Gate on confidence, not blast radius. A confident agent deletes prod. → Irreversibility sets the gate, always.
  • Rubber-stamp approvals. Context-free yes/no → humans approve blindly. → Show what + why; batch the trivial.
  • Corrections discarded. Edits/rejections vanish into the product. → Capture structured; feed evals + prompt iteration.
  • Static autonomy. Shipped fully-auto on day one, or gated forever. → Climb/descend the ladder on measured agreement.
  • Threshold mis-set. Too eager → alert fatigue; too shy → unsafe output ships. → Calibrate against cost; watch fatigue as a metric.
  • No audit on gated actions. Approval happened, no record. → Signed ledger entry per privileged action.

See also