Engineer what goes into the context window — selection, ordering, compaction, retention — because the context is the program the model runs.

Context Management Pattern

Engineer what goes into the context window — selection, ordering, compaction, retention — because the context is the program the model runs.

TL;DR (human)

The context window is a scarce, attention-limited budget, not a bucket you fill. Quality and reliability depend more on what you put in front of the model and where than on the model itself. Engineer it deliberately: a fixed budget split across system/tools/retrieved/history/scratch; retrieve the few most relevant chunks, not everything; place the most important content at the edges (models attend least to the middle); compact long histories into running summaries before they overflow; and persist durable facts to external memory instead of re-stuffing them every turn. More tokens is not more intelligence — it is more noise and more cost.

For agents

The context is the program

A model has no state between calls except what you re-supply. Every turn, you assemble the program it executes: instructions, tools, retrieved knowledge, prior turns, and working notes. The model's output quality is bounded by the quality of that assembly. Most "the model isn't smart enough" problems are context problems — the right fact was absent, buried in the middle, or drowned by 50 irrelevant ones.

Budget the window like memory

Treat the window as a fixed allocation, not a free pool. A typical split:

Region	Holds	Discipline
System / instructions	Role, rules, output contract	Stable, versioned (see `prompt-versioning-pattern.md`); keep tight
Tool definitions	Only the tools relevant to this task	Gate the toolset; 50 tool specs is noise and a latency tax
Retrieved knowledge	Top-k most relevant chunks	Few and on-target beats many and vague
Conversation history	Recent turns + compacted older summary	Compact, don't accumulate
Working scratch / artifacts	Current files, intermediate results	Reference by handle when large

Reserve headroom for the output. Track utilization as a metric: an agent running at 95% of the window is one long input away from truncation failures.

Relevance over volume — the retrieval discipline

Stuffing "everything that might help" actively degrades quality: it dilutes attention, raises cost and latency, and increases the chance the model fixates on an irrelevant passage.

Retrieve, rank, then trim to top-k. Fewer, higher-precision chunks beat exhaustive recall.
Compress chunks to the spans that matter; drop boilerplate.
Deduplicate. Three paraphrases of the same fact waste budget and over-weight it.
Cite what you injected so the output (and your faithfulness eval) can be traced to sources, not to the model's parametric guesses (see hallucination-reduction-pattern.md).

Position matters — the lost-in-the-middle effect

Models attend most to the start and end of the context and least to the middle. Long-context capacity is not uniform recall. So:

Put the task instruction and the output contract at the end, nearest the generation point.
Put critical constraints and the most important retrieved fact near the top or bottom, never buried mid-stack.
Don't assume "it's in the window" means "it will be used." If a fact is load-bearing, position it and, for the highest stakes, restate it.

Compaction — surviving long sessions

A long-running agent will exceed any window. Compact instead of truncating blindly:

Running summary. Periodically replace the oldest turns with a model-written summary that preserves decisions, open threads, and constraints — discard the verbatim chatter.
Summarize on a budget threshold, not a turn count, so compaction fires when pressure is real.
Preserve the load-bearing, drop the transient. Decisions, identifiers, user constraints survive; greetings and resolved tangents go.
Keep a pointer to the full record. The compacted summary references where the detail lives (a log, a file) so nothing is irrecoverable — it is demoted, not deleted.
Pin anchors. The original goal and hard constraints should never be summarized away; re-inject them verbatim each compaction.

External memory — don't re-stuff durable facts

Anything that must persist across sessions does not belong in the rolling context — it belongs in durable memory the agent reads on demand:

A facts store (one fact per record, retrieved by relevance) keeps the window lean and survives session ends.
A bootstrap doc (CLAUDE.md/AGENTS.md) is the always-loaded slice; everything else is fetched when relevant.
This is the difference between an agent that re-learns the project every turn and one that remembers.

See memory-pattern.md and bootstrap-doc-pattern.md.

Caching the stable prefix

If your stack supports prompt caching, order context stable-first: system + tools + long-lived context at the front (cache hit), volatile turn-specific content at the back. Reordering the stable prefix every turn throws the cache away — a silent latency and cost regression. Cache-awareness is a context-ordering decision.

Context retention as an eval dimension

"Did the agent remember the constraint from 20 turns ago?" is measurable. Add retention cases to the eval suite: long inputs with a load-bearing fact early, then a question that needs it. Track the score per context-length slice — this is exactly where aggregate metrics hide regressions (see ../quality/agent-eval-framework-pattern.md).

Common failure modes

Stuff-everything retrieval. Dilutes attention, raises cost, invites fixation on noise. → Top-k, ranked, trimmed, deduped.
Load-bearing fact in the middle. Present but ignored (lost-in-the-middle). → Position at edges; restate the critical ones.
Blind truncation at overflow. Drops the goal or a key constraint silently. → Compact to a summary; pin anchors.
Re-stuffing durable facts every turn. Burns budget, still forgets across sessions. → External memory, retrieve on demand.
Reordering the stable prefix each turn. Kills the prompt cache; cost/latency regress invisibly. → Stable-first ordering.
Treating long-context models as infinite uniform recall. Quality degrades with fill even under the limit. → Budget + measure utilization; retention evals per length slice.
Full toolset always in context. 50 tool specs of noise + latency. → Gate tools to the task.

Context Management Pattern

Context Management Pattern

TL;DR (human)

For agents

The context is the program

Budget the window like memory

Relevance over volume — the retrieval discipline

Position matters — the lost-in-the-middle effect

Compaction — surviving long sessions

External memory — don't re-stuff durable facts

Caching the stable prefix

Context retention as an eval dimension

Common failure modes

See also

On this page