Platform Engineering + Internal Developer Platform Pattern
How to build the layer between cloud / infra and product engineers, so product teams ship fast without re-learning infra each time.
Platform Engineering + Internal Developer Platform Pattern
How to build the layer between cloud / infra and product engineers, so product teams ship fast without re-learning infra each time.
TL;DR (human)
Platform engineering productises infrastructure for internal developers. The deliverable is an Internal Developer Platform (IDP) — a curated set of golden paths (templates, paved roads, self-service tools) that let teams ship without becoming infra experts. Tracked metrics: lead time, deploy frequency, change failure rate, time to recover (DORA). The platform team's customer is the product team.
For agents
What an IDP includes
| Surface | What it provides |
|---|---|
| Service templates | new-service scaffolds: code, CI, deploy, observability wired |
| Self-service deploy | "I want to ship this" → one command / PR |
| Self-service env | "I need staging for my PR" → ephemeral env automatically |
| Service catalog | "Where does X live?" → searchable inventory (Backstage et al) |
| Observability defaults | dashboards, alerts, SLOs scaffolded per service |
| Secrets self-service | "I need a new vault entry" → request + audit |
| Documentation hub | API specs, ADRs, runbooks indexed |
| Cost visibility | per-team / per-service spend dashboards |
The platform is a product. Product engineers are customers; survey them.
Golden paths
A golden path is the easy, paved way to do a common thing — vs the unpaved way, which is allowed but offers no support.
Examples:
- New microservice: scaffold → live in 30 minutes.
- Add a new background job: existing job-runner abstraction.
- Add a new RPC method: existing dispatcher + schema package.
- Add a new database: managed RDS via IaC.
Going off-road is allowed. Going off-road for everything = the platform isn't paving real needs.
Self-service ≠ unsupervised
Self-service means a product engineer can do it without filing a ticket. It doesn't mean unreviewed:
- Pre-flight checks (cost, security review, capacity).
- Audit trail of who provisioned what.
- Default-secure choices baked in.
- Auto-rollback on health-check failure.
Self-service + guardrails = velocity + safety. Self-service without guardrails = chaos. Guardrails without self-service = bottleneck.
DORA metrics
Per-team or per-service measurement (DevOps Research and Assessment):
| Metric | Definition | Elite team target |
|---|---|---|
| Deploy frequency | How often code reaches prod | Multiple per day |
| Lead time for changes | Commit → production | < 1 day |
| Change failure rate | % of deploys causing incident | 0-15% |
| Time to restore | Incident → resolved | < 1 hour |
These metrics drive platform investment. The platform team's goal is to move every product team toward elite.
Service catalog
Per service:
- Name + description.
- Owning team.
- On-call info.
- Source repo.
- Documentation links (ADRs, RFCs, runbooks).
- Dependencies (which services this depends on; which depend on this).
- Tier (critical, important, supporting).
- Compliance tags.
Tool: Backstage, Cortex, OpsLevel, in-house.
The catalog answers "who do I ask?" + "what depends on this?" + "is this safe to change?".
Templates + scaffolds
A create-\<thing\> CLI / template:
$ npx create-service my-new-service --type=node-ts-api
# scaffolds:
# - source skeleton
# - package.json with workspace conventions
# - Dockerfile + CI workflow
# - terraform module
# - observability defaults
# - README + ADR templateTemplates encode conventions (per universal.md, ts-concrete.md) so new services start compliant.
Templates evolve; old services migrate via a separate "update-service" tool.
Ephemeral environments
Per-PR preview environments:
- Triggered automatically by PR open.
- Live URL posted to PR.
- Resources auto-torn-down on PR close (or N days idle).
- Cost-attributed to the PR author / team.
Lets reviewers + designers + product see real changes without staging churn.
Capacity + cost guardrails
Self-service is dangerous without guardrails:
- Per-team budget caps.
- Auto-tear-down of idle resources.
- Required tags (per
../quality/cost-optimization-pattern.md). - New service request → reviewed if cost > threshold.
Documentation as a platform feature
A docs portal aggregates:
- Per-service READMEs (Backstage-rendered).
- API references (OpenAPI / GraphQL schemas / RPC method indexes).
- ADRs / RFCs.
- Runbooks.
- Tutorials / quickstarts.
- Search.
Engineers find docs in seconds, not minutes. Search quality matters more than doc volume.
Platform team's customer
Product engineers. Their satisfaction is the platform team's KPI:
- Onboarding time: new engineer → first PR shipped.
- Time to spin up a new service.
- "How happy are you with the platform?" — quarterly survey.
- Support tickets: their volume + topics.
Anti-pattern: platform team optimises for its own elegance, ignores adopters' pain.
Inner-source contributions
Product teams contribute back to the platform:
- A team builds a niche helper; useful broadly; promote into platform.
- A team finds a bug in a template; patches it; gets credit.
- Codeowners model: platform team owns merging; community drives PRs.
Healthy platforms grow from the edges, not from the center alone.
Anti-pattern: the gatekeeper platform
Symptoms:
- Every product change requires a platform ticket.
- "Wait 2 weeks for the platform team to enable this."
- Workarounds proliferate.
- Product teams build shadow infrastructure.
Cure: more golden paths; more self-service; fewer tickets.
Common failure modes
- No platform team; everyone reinvents. Drift; bus factor low. → Form a platform team when the org passes ~30 engineers.
- Platform team builds without users. Adoption zero. → Adopt-first design.
- Self-service without guardrails. Cost / security incidents. → Guardrails baked in.
- Old services don't migrate. Platform forks; legacy slows. → Migration tooling; deprecation.
- Templates rot. New service uses old template; conventions wrong. → Templates own conventions; CI verifies.
- No service catalog. Org-wide knowledge in heads. → Backstage or equivalent.
Tooling stack (typical)
| Concern | Tool |
|---|---|
| Service catalog | Backstage, Cortex, OpsLevel, Compass |
| IaC | Terraform, Pulumi, CDK, Crossplane |
| Templates | cookiecutter, plop, Backstage Software Templates |
| Ephemeral envs | Vercel preview deploys, Render, Fly.io, custom on k8s |
| Self-service portals | Backstage, in-house |
| DORA metrics | Faros, LinearB, in-house from CI + deploy logs |
Adoption path
- < 30 engineers: no platform team needed; shared playbook.
- 30-100: first platform engineer; service catalog; templates.
- 100-300: platform team of 3-8; self-service deploy; DORA tracking.
- 300+: full IDP; multiple platform sub-teams (compute, data, security, observability).
Don't form a platform team too early; you have nothing to platform yet.
See also
anti-overengineering.md— premature platform = canonical overengineering.../quality/ci-cd-pipeline-pattern.md— platform owns the pipeline.../quality/cost-optimization-pattern.md— platform owns cost attribution.../governance/universal.md— platform changes go through ADR / RFC.