Agents Playbook
Pillars/Quality

Cost Optimization Pattern (FinOps)

How to control cloud spend without micromanaging every commit.

Cost Optimization Pattern (FinOps)

How to control cloud spend without micromanaging every commit.

TL;DR (human)

Cloud spend without FinOps doubles every 18 months on autopilot. Discipline: per-team budget; per-tenant attribution; per-workload right-sizing; commitments + spot for predictable load; caching + query budgets; CI runs cost-aware too. The goal is "spend roughly what we said we'd spend" — not minimise at all cost.

For agents

Three FinOps phases (the FinOps Foundation framework)

PhaseQuestionTools
InformWhere is the money going?Cost dashboards; per-service / per-team / per-tenant attribution
OptimizeWhat can we cut without harm?Right-sizing; commitments; spot; cache; query reduction
OperateHow do we keep it that way?Budgets; alerts; per-PR cost gates; FinOps rituals

Most teams skip straight to Optimize; that's wrong. Inform first; without attribution, optimization is guesswork.

Inform — cost attribution

Every dollar should answer:

  • Service: which microservice / Lambda / managed service.
  • Environment: prod / staging / dev.
  • Team: who owns it; who reviews bills.
  • Tenant (multi-tenant systems): which customer drives the cost.
  • Feature / surface (optional but powerful): which product surface.

Achieved via:

  • Cloud tags / labels: applied to every resource at creation time; enforced via IaC (Terraform, Pulumi, CDK).
  • Per-request tagging: spans / logs / metrics carry tenant + service tags.
  • Cost allocation reports: cloud-native (AWS Cost Explorer, GCP Billing, Azure Cost Management) + per-tenant rollup.

Untagged resources = mystery costs. Hard rule: no untagged resources in production.

Per-tenant attribution

In multi-tenant SaaS, per-tenant cost drives:

  • Pricing: usage-based or tier-based pricing depends on cost knowledge.
  • Customer success: tenants spending 10× more than they pay are flight risks (acquisition cost will outweigh).
  • Capacity planning: who would grow + how much.
  • Quota tuning: where to put limits.

Computed from observability tags (per observability-pattern.md). Roll up nightly into a per-tenant cost table.

Right-sizing

Most workloads over-provision. Symptoms:

  • CPU steady < 30%.
  • Memory steady < 50%.
  • Network rarely saturated.

Right-sizing process:

  1. Measure: 30+ days of utilisation per instance / service.
  2. Recommend: smaller instance class; lower memory; fewer replicas.
  3. Stage: change in staging; measure.
  4. Promote: change in production with rollback path.

Hold a reasonable cushion (50–70% utilisation steady-state; lower for spiky workloads).

Auto-scaling helps but only when:

  • Cold-start is acceptable (sub-minute scale-up).
  • Stateless workload.
  • Predictable load shape.

Commitments + spot

Cloud providers reward predictability:

  • Reserved instances / commitments (1y, 3y): 30–70% off list price.
  • Spot instances: 60–90% off; reclaimed on short notice.

Mix:

  • Steady baseline: covered by commitments (~70% of capacity).
  • Burst above baseline: spot or on-demand.
  • Stateful / critical: on-demand or reserved; never spot.

Commitments are a forecasting bet. Under-commit and miss the discount; over-commit and pay for unused capacity. Default to under-committing.

Query + storage budgets

In data-heavy systems, database cost often dominates compute.

Per-endpoint discipline (extends performance-budgets-pattern.md):

  • Query count budget per request (N+1 detection: > 20 = probable bug).
  • Bytes scanned budget per request (avoid full-table scans).
  • Result size budget (paginate everything; max 100 rows per page default).
  • Cold storage tier for data > 90 days unused.
  • Compression: enable everywhere it pays (most columnar stores).

Per-tenant query budgets prevent noisy-neighbor cost spikes:

Egress + data transfer

Often the surprise on cloud bills:

  • Cross-region transfer: usually expensive.
  • Egress to internet: expensive at scale.
  • Cross-AZ transfer: sometimes free, sometimes not.

Mitigations:

  • Keep data hot in the same region / AZ as the consumer.
  • CDN for public assets (one-time push; cheap edge serving).
  • Avoid cross-region replication for non-critical data.
  • Compress on the wire.

Background jobs + queues

Cheaper than synchronous:

  • Async background jobs scale on cheaper compute (spot OK).
  • Queues buffer bursts; smooth provisioning.
  • Job retries are cheap when the work is idempotent.

Discipline:

  • Idempotency-key on every job (replays don't double-charge).
  • Dead-letter queue for permanently failing jobs.
  • Visibility into queue depth + worker utilisation.

Cache hit rate

A cache pays for itself when:

  • Hit rate > 60% in steady state.
  • Origin compute / DB cost per request > cache cost per request.

Measure cache hit rate per cache; track over time. Hit rate drops are signals (key churn, app pattern change, eviction pressure).

See ../architecture/distributed-data-pattern.md for cache tiers.

Dev / CI cost

Often overlooked:

  • CI minutes: cache aggressively (per ci-cd-pipeline-pattern.md).
  • Per-PR ephemeral environments: convenient but expensive; lifecycle them (auto-tear-down after N days).
  • Dev databases: long-running instances; sleep / terminate on no-activity.
  • Build artefact storage: tier old artefacts to cold; expire after N versions.

A 10× cost gap exists between "every team has its own everything 24/7" and "shared dev infrastructure with lifecycle policies".

Cost gates

Beyond budgets, per-PR cost signals:

  • Bundle size increase → bandwidth + CDN cost.
  • New cloud resources in IaC diff → manual review.
  • New paid service dependency → PR comment with monthly estimate.
  • Performance regression → potentially higher per-request cost.

These extend the gate suite (see quality-gates-pattern.md).

FinOps rituals

CadenceActivity
DailyCost-anomaly alerts trigger on spike
WeeklyTop spenders dashboard reviewed
MonthlyPer-service / per-team budgets reviewed
QuarterlyCommitments + right-sizing review
AnnuallyCloud-provider contract negotiation; multi-cloud strategy review

Anomaly detection

A 2× cost spike in 24h means something changed. Possible causes:

  • New deploy with a query regression.
  • A customer's traffic spike (good or bad).
  • A bug producing infinite retries.
  • A test inadvertently shipped that hits an expensive path.
  • Account compromise (cryptominer; spam).

Anomaly alert routes to the team that owns the service. SEV depends on magnitude:

  • 1.5× = warn.
  • 3× = SEV-3.
  • 10× = SEV-1 (probable runaway or compromise).

Cost-per-X metrics

Useful tracking metrics:

  • Cost per active user (DAU / MAU).
  • Cost per request.
  • Cost per tenant (per pricing-tier).
  • Cost per transaction (for products with discrete units of value).

Surface these in dashboards alongside business metrics. Engineering leadership reasons about cost in product terms.

Multi-cloud caveat

Multi-cloud is sometimes proposed for cost. Reality:

  • Cross-cloud egress is expensive; data gravity locks you in.
  • Operations cost (per-cloud expertise, tooling) often exceeds savings.
  • Single-cloud + multi-region usually delivers most of the resilience.

Adopt multi-cloud for sovereignty / vendor risk / specific service needs — not for cost.

Common failure modes

  • No tagging. Cost mystery; cannot attribute. → Enforce in IaC.
  • Over-provisioning on autopilot. Auto-scaling configured min ≥ peak; never scales down. → Right-size min.
  • No budgets. Bills surprise leadership. → Per-team budget; alerts at 50/75/90%.
  • Cost work treated as separate from engineering. → Engineers see their cost; act on it.
  • Optimisations that break SLOs. Cost down; quality down. → Cost is constrained-by SLO, not above it.
  • Spot for stateful. Reclaimed mid-write. → On-demand for state; spot for stateless.
  • Cross-region traffic accidental. Service in region A calls DB in region B. → Network policy + alert.

Tooling stack (typical)

ConcernTool
Cloud-native costAWS Cost Explorer + Budgets, GCP Billing, Azure Cost Management
Per-resource analysisVantage, Infracost, CloudHealth, Spot.io
Per-Kubernetes pod costKubecost, OpenCost
Anomaly detectionCloudZero, native cloud anomaly alerts
Per-tenant attributionIn-house roll-up from observability tags
Right-sizing recommendationsAWS Compute Optimizer, GCP Recommender
FinOps governanceFinOps Foundation framework

Adoption path

  1. Day 0: tag everything; one budget per environment.
  2. Month 1: per-service attribution; top-spenders dashboard.
  3. Quarter 1: right-sizing review; first commitments / reservations.
  4. Quarter 2: per-tenant attribution; cost-per-X dashboards.
  5. Quarter 3+: PR-level cost signals; cost-aware feature design.
  6. Mature: FinOps team / role; engineering OKRs include cost.

See also