Cost Optimization Pattern (FinOps)

How to control cloud spend without micromanaging every commit.

TL;DR (human)

Cloud spend without FinOps doubles every 18 months on autopilot. Discipline: per-team budget; per-tenant attribution; per-workload right-sizing; commitments + spot for predictable load; caching + query budgets; CI runs cost-aware too. The goal is "spend roughly what we said we'd spend" — not minimise at all cost.

For agents

Three FinOps phases (the FinOps Foundation framework)

Phase	Question	Tools
Inform	Where is the money going?	Cost dashboards; per-service / per-team / per-tenant attribution
Optimize	What can we cut without harm?	Right-sizing; commitments; spot; cache; query reduction
Operate	How do we keep it that way?	Budgets; alerts; per-PR cost gates; FinOps rituals

Most teams skip straight to Optimize; that's wrong. Inform first; without attribution, optimization is guesswork.

Inform — cost attribution

Every dollar should answer:

Service: which microservice / Lambda / managed service.
Environment: prod / staging / dev.
Team: who owns it; who reviews bills.
Tenant (multi-tenant systems): which customer drives the cost.
Feature / surface (optional but powerful): which product surface.

Achieved via:

Cloud tags / labels: applied to every resource at creation time; enforced via IaC (Terraform, Pulumi, CDK).
Per-request tagging: spans / logs / metrics carry tenant + service tags.
Cost allocation reports: cloud-native (AWS Cost Explorer, GCP Billing, Azure Cost Management) + per-tenant rollup.

Untagged resources = mystery costs. Hard rule: no untagged resources in production.

Per-tenant attribution

In multi-tenant SaaS, per-tenant cost drives:

Pricing: usage-based or tier-based pricing depends on cost knowledge.
Customer success: tenants spending 10× more than they pay are flight risks (acquisition cost will outweigh).
Capacity planning: who would grow + how much.
Quota tuning: where to put limits.

Computed from observability tags (per observability-pattern.md). Roll up nightly into a per-tenant cost table.

Right-sizing

Most workloads over-provision. Symptoms:

CPU steady < 30%.
Memory steady < 50%.
Network rarely saturated.

Right-sizing process:

Measure: 30+ days of utilisation per instance / service.
Recommend: smaller instance class; lower memory; fewer replicas.
Stage: change in staging; measure.
Promote: change in production with rollback path.

Hold a reasonable cushion (50–70% utilisation steady-state; lower for spiky workloads).

Auto-scaling helps but only when:

Cold-start is acceptable (sub-minute scale-up).
Stateless workload.
Predictable load shape.

Commitments + spot

Cloud providers reward predictability:

Reserved instances / commitments (1y, 3y): 30–70% off list price.
Spot instances: 60–90% off; reclaimed on short notice.

Mix:

Steady baseline: covered by commitments (~70% of capacity).
Burst above baseline: spot or on-demand.
Stateful / critical: on-demand or reserved; never spot.

Commitments are a forecasting bet. Under-commit and miss the discount; over-commit and pay for unused capacity. Default to under-committing.

Query + storage budgets

In data-heavy systems, database cost often dominates compute.

Per-endpoint discipline (extends performance-budgets-pattern.md):

Query count budget per request (N+1 detection: > 20 = probable bug).
Bytes scanned budget per request (avoid full-table scans).
Result size budget (paginate everything; max 100 rows per page default).
Cold storage tier for data > 90 days unused.
Compression: enable everywhere it pays (most columnar stores).

Per-tenant query budgets prevent noisy-neighbor cost spikes:

Max query CPU-time per tenant per minute.
Max bytes scanned per tenant per hour.
Circuit-break at limit; surface as QUOTA_EXCEEDED (see ../security/multi-tenant-isolation-pattern.md).

Egress + data transfer

Often the surprise on cloud bills:

Cross-region transfer: usually expensive.
Egress to internet: expensive at scale.
Cross-AZ transfer: sometimes free, sometimes not.

Mitigations:

Keep data hot in the same region / AZ as the consumer.
CDN for public assets (one-time push; cheap edge serving).
Avoid cross-region replication for non-critical data.
Compress on the wire.

Background jobs + queues

Cheaper than synchronous:

Async background jobs scale on cheaper compute (spot OK).
Queues buffer bursts; smooth provisioning.
Job retries are cheap when the work is idempotent.

Discipline:

Idempotency-key on every job (replays don't double-charge).
Dead-letter queue for permanently failing jobs.
Visibility into queue depth + worker utilisation.

Cache hit rate

A cache pays for itself when:

Hit rate > 60% in steady state.
Origin compute / DB cost per request > cache cost per request.

Measure cache hit rate per cache; track over time. Hit rate drops are signals (key churn, app pattern change, eviction pressure).

See ../architecture/distributed-data-pattern.md for cache tiers.

Dev / CI cost

Often overlooked:

CI minutes: cache aggressively (per ci-cd-pipeline-pattern.md).
Per-PR ephemeral environments: convenient but expensive; lifecycle them (auto-tear-down after N days).
Dev databases: long-running instances; sleep / terminate on no-activity.
Build artefact storage: tier old artefacts to cold; expire after N versions.

A 10× cost gap exists between "every team has its own everything 24/7" and "shared dev infrastructure with lifecycle policies".

Cost gates

Beyond budgets, per-PR cost signals:

Bundle size increase → bandwidth + CDN cost.
New cloud resources in IaC diff → manual review.
New paid service dependency → PR comment with monthly estimate.
Performance regression → potentially higher per-request cost.

These extend the gate suite (see quality-gates-pattern.md).

FinOps rituals

Cadence	Activity
Daily	Cost-anomaly alerts trigger on spike
Weekly	Top spenders dashboard reviewed
Monthly	Per-service / per-team budgets reviewed
Quarterly	Commitments + right-sizing review
Annually	Cloud-provider contract negotiation; multi-cloud strategy review

Anomaly detection

A 2× cost spike in 24h means something changed. Possible causes:

New deploy with a query regression.
A customer's traffic spike (good or bad).
A bug producing infinite retries.
A test inadvertently shipped that hits an expensive path.
Account compromise (cryptominer; spam).

Anomaly alert routes to the team that owns the service. SEV depends on magnitude:

1.5× = warn.
3× = SEV-3.
10× = SEV-1 (probable runaway or compromise).

Cost-per-X metrics

Useful tracking metrics:

Cost per active user (DAU / MAU).
Cost per request.
Cost per tenant (per pricing-tier).
Cost per transaction (for products with discrete units of value).

Surface these in dashboards alongside business metrics. Engineering leadership reasons about cost in product terms.

Multi-cloud caveat

Multi-cloud is sometimes proposed for cost. Reality:

Cross-cloud egress is expensive; data gravity locks you in.
Operations cost (per-cloud expertise, tooling) often exceeds savings.
Single-cloud + multi-region usually delivers most of the resilience.

Adopt multi-cloud for sovereignty / vendor risk / specific service needs — not for cost.

Common failure modes

No tagging. Cost mystery; cannot attribute. → Enforce in IaC.
Over-provisioning on autopilot. Auto-scaling configured min ≥ peak; never scales down. → Right-size min.
No budgets. Bills surprise leadership. → Per-team budget; alerts at 50/75/90%.
Cost work treated as separate from engineering. → Engineers see their cost; act on it.
Optimisations that break SLOs. Cost down; quality down. → Cost is constrained-by SLO, not above it.
Spot for stateful. Reclaimed mid-write. → On-demand for state; spot for stateless.
Cross-region traffic accidental. Service in region A calls DB in region B. → Network policy + alert.

Tooling stack (typical)

Concern	Tool
Cloud-native cost	AWS Cost Explorer + Budgets, GCP Billing, Azure Cost Management
Per-resource analysis	Vantage, Infracost, CloudHealth, Spot.io
Per-Kubernetes pod cost	Kubecost, OpenCost
Anomaly detection	CloudZero, native cloud anomaly alerts
Per-tenant attribution	In-house roll-up from observability tags
Right-sizing recommendations	AWS Compute Optimizer, GCP Recommender
FinOps governance	FinOps Foundation framework

Adoption path

Day 0: tag everything; one budget per environment.
Month 1: per-service attribution; top-spenders dashboard.
Quarter 1: right-sizing review; first commitments / reservations.
Quarter 2: per-tenant attribution; cost-per-X dashboards.
Quarter 3+: PR-level cost signals; cost-aware feature design.
Mature: FinOps team / role; engineering OKRs include cost.

Cost Optimization Pattern (FinOps)

Cost Optimization Pattern (FinOps)

TL;DR (human)

For agents

Three FinOps phases (the FinOps Foundation framework)

Inform — cost attribution

Per-tenant attribution

Right-sizing

Commitments + spot

Query + storage budgets

Egress + data transfer

Background jobs + queues

Cache hit rate

Dev / CI cost

Cost gates

FinOps rituals

Anomaly detection

Cost-per-X metrics

Multi-cloud caveat

Common failure modes

Tooling stack (typical)

Adoption path

See also

On this page