Tool & Capability Design Pattern
Design the abstraction layer between a model and your product — tools, file systems, artifacts, skills — so complex multi-step work feels natural to the model and stays reliable for users.
Tool & Capability Design Pattern
Design the abstraction layer between a model and your product — tools, file systems, artifacts, skills — so complex multi-step work feels natural to the model and stays reliable for users.
TL;DR (human)
Tools are the model's API to your product, and the model is a user of that API with peculiar ergonomics: it reads the description as the spec, can't see your code, and fails differently than humans. Design tools at the altitude of user intent ("schedule a meeting"), not raw endpoints; make descriptions and schemas self-explanatory; constrain inputs so whole error classes are impossible; return results the model can act on (including good errors); keep the exposed toolset small and gated to the task; and treat skills/artifacts/file-systems as first-class abstractions that turn multi-step workflows into something the model can drive reliably. The tool surface is product design for a non-human user — and it determines reliability as much as the prompt.
For agents
A tool is an API whose user is a model
The model can't read your source. The tool name, description, and schema are the entire contract — they are prompt-engineering surface, not just type plumbing. A vague description or an ambiguous parameter is a bug that shows up as "the model keeps calling it wrong." Write tool specs the way you'd write docs for an external developer who only has the function signature — because that's exactly the situation.
Design at the altitude of intent
Wrapping every REST endpoint 1:1 makes the model orchestrate plumbing — chaining six low-level calls, holding state, and failing in the middle. Instead, expose tools at the granularity of what the user wants to accomplish:
scheduleMeeting(attendees, window)— notlistCalendars+getFreeBusy+createEvent+sendInvitesfor the model to assemble.- One tool = one coherent unit of intent the model can reason about atomically.
- Push multi-step orchestration into the tool where the steps are deterministic; leave the model to decide which intent, not to hand-assemble every primitive.
Higher altitude = fewer turns, less state for the model to drop, fewer places to fail.
Make the contract self-explanatory and constrained
- Names are semantic.
searchCustomersnotquery2. The name alone should imply when to call it. - Descriptions state when to use, when not to, and what comes back. Include the failure shape. Disambiguate from sibling tools ("use this for X; for Y use
otherTool"). - Schema constrains the input space. Enums over free strings, required fields explicit, value ranges encoded. A constrained schema makes a class of wrong calls impossible rather than merely discouraged — the strongest form of guidance (see
../architecture/contracts-zod-pattern.md). - Validate at the boundary and return a usable error. Errors are part of the model's loop: a good error says what was wrong and how to fix the call, so the model self-corrects instead of looping. A raw stack trace is a dead end (see
../architecture/error-hierarchy.md).
Return results the model can use
- Shape output for consumption, not for humans. Structured, concise, relevant. A tool that dumps 10k tokens of raw JSON blows the context budget and buries the signal (see
context-management-pattern.md). - Reference large payloads by handle. Write big artifacts to a file/store and return a reference; let the model pull what it needs. Don't pour a 50-page document into the window.
- Be honest about partial success. "Created 3 of 5, these 2 failed because…" beats a flat success/error the model can't reason about.
Keep the toolset small and gated
Every tool definition spends context tokens and adds a choice the model can get wrong. Fifty tools in the window is noise, latency, and mis-selection.
- Expose only the tools relevant to the current task/state. Gate the set; reveal capabilities as the workflow reaches them.
- Prune overlap. Two tools that do almost the same thing guarantee the model sometimes picks the wrong one.
- Authorize at call time regardless. The toolset is ergonomics; permission is enforcement — the model offering a tool is never authorization to run it (see
../security/ai-llm-safety-pattern.md).
File systems, artifacts & skills as first-class abstractions
The richest agent products give the model more than function calls — they give it an environment:
- A file system / artifact store lets the model produce, revise, and reference durable work products instead of regenerating everything inline. Artifacts are how multi-step output stays coherent across turns and how large results stay out of the context window (handle, not blob).
- Skills package a reusable capability — instructions + tools + context for a task class — that the model invokes as a unit. A skill turns "here are 12 primitives, figure it out each time" into "do this known thing well," which is the difference between brittle and reliable on complex workflows.
- Design these as the abstraction layer between product and model: the product exposes capabilities (files, artifacts, skills); the model composes them. Get this layer right and complex multi-step work feels natural to the model and reliable to users — get it wrong and the model improvises plumbing and drops state.
This abstraction layer is also exactly what you publish as a machine-readable surface so other agents can discover and drive it — see self-describe-pattern.md.
Tools are versioned, tested, evaluated
A tool's description and schema are behavior — when you change them, re-run the evals (the model may now call it differently). Version tool definitions alongside prompts (prompt-versioning-pattern.md), and add tool-call assertions to the deterministic eval tier (../quality/agent-eval-framework-pattern.md): given input X, the right tool is called with the right args, and destructive tools are not called unbidden.
Common failure modes
- 1:1 endpoint wrappers. Model orchestrates plumbing, drops state mid-chain. → Design at the altitude of user intent.
- Vague descriptions. "The model keeps calling it wrong" — because the spec is the description. → Say when/when-not/returns; disambiguate siblings.
- Free-string params that should be enums. Invites invalid calls. → Constrain the schema so wrong calls are impossible.
- Raw errors returned to the model. Dead-end loops. → Actionable error: what's wrong + how to fix the call.
- Tool dumps huge payloads inline. Blows context, buries signal. → Reference by handle; shape output for consumption.
- Fifty tools always exposed. Noise, latency, mis-selection. → Gate to the task; prune overlap.
- Tool change shipped without re-eval. Model's calling behavior silently shifts. → Version + tool-call assertions in the suite.
- Treating the toolset as authorization. Offered ≠ allowed. → Enforce permission at call time.
See also
self-describe-pattern.md— publish the capability surface for external agents.prompt-versioning-pattern.md— tool defs are versioned behavior.context-management-pattern.md— toolset size + result shape are context costs.../quality/agent-eval-framework-pattern.md— tool-call assertions as deterministic evals.../architecture/contracts-zod-pattern.md— schemas that constrain tool input.../architecture/error-hierarchy.md— error shapes the model can act on.../security/ai-llm-safety-pattern.md— tool authorization + injection defense.