Distributed Data Pattern
How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.
Distributed Data Pattern
How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.
TL;DR (human)
Distributed data starts with read replicas (cheap, mostly transparent). Then sharding (expensive, design-defining). Then multi-region (operationally hard, recovery-defining). Each step trades consistency for availability and complexity. Pick the cheapest one that solves the actual problem; do not adopt the next tier speculatively.
For agents
The CAP triangle, briefly
A distributed system under partition can guarantee at most two of: Consistency, Availability, Partition tolerance. Real systems are not on the corners — they pick a position on the edges.
- CP (consistency over availability under partition): banking, audit ledgers. Reads/writes refuse if quorum unreachable.
- AP (availability over consistency): social feed, analytics. Reads/writes succeed; data is eventually consistent.
- CA (no partition tolerance — only viable in a single node).
You will be AP for most user-facing data and CP for money + audit + identity.
Step 1 — Read replicas
Cheap, mostly transparent. One primary handles writes; N replicas serve reads.
Rules:
- Writes go to primary. Always.
- Reads with strict freshness go to primary. Authentication, "did my write land", post-transaction reads.
- Reads that tolerate staleness go to replicas. Listings, dashboards, analytics.
- The application layer chooses:
db.replica.users.list(...)vsdb.primary.users.list(...). Not the ORM's auto-magic. Auto-routing produces surprise replication-lag bugs.
Replication lag:
- Typically tens to hundreds of milliseconds in steady state.
- Spikes to seconds under load.
- Tail can reach minutes during failover.
Design your queries to tolerate the worst-case lag, or send the affected query to primary.
Step 2 — Sharding
When data per primary exceeds what one node handles — typically when total data approaches a TB or QPS exceeds 10k+ — shard.
Shard key choice is permanent (or at least very expensive to change). Get it right.
Good shard keys:
tenant_id/workspace_idfor multi-tenant systems (most queries are per-tenant).user_idfor user-facing systems.- Time-bucketed for append-heavy systems (event logs).
Bad shard keys:
- Auto-increment id (sequential = hot last shard).
created_atonly (hot active shard).- Anything that produces "one big shard" (one popular tenant).
Cross-shard queries are expensive. Design queries so 95%+ stay within a shard.
Resharding is its own discipline:
- Pre-split into more shards than you currently need (over-shard).
- Use logical shards mapped to physical nodes; moving a logical shard is a node-add operation.
- Tools: Vitess, Citus, application-level sharding with consistent hashing.
Step 3 — Multi-region
When users are distributed globally OR the failure of one region must not take the system down — multi-region.
Three patterns:
- Active-passive: one region writes; others stand by. Failover is operator-driven; RPO = replication lag, RTO = minutes.
- Active-active with leader per partition: each partition (tenant, customer, geographic block) has a leader region. Writes to your data only succeed in your region. Cross-partition operations are rare and expensive.
- Fully active-active with CRDT / multi-leader: writes succeed anywhere; conflicts resolved at the data layer. Expensive but powerful.
Most products start at 1. Mature SaaS at 2. Few need 3.
RPO / RTO
Per system, document:
- RPO (Recovery Point Objective): how much data can we lose in a disaster? "5 minutes" means replication is configured for ≤ 5-min lag.
- RTO (Recovery Time Objective): how long to be back up? "30 minutes" means the failover procedure must complete within that.
RPO and RTO are promises. The infrastructure must be able to deliver them; the runbook must be tested.
Replication lag — visible in product
When you have replicas, replication lag becomes a product concern:
- Write-then-read in the same request: route the read to primary or use a session-pinned router.
- Write-then-read across requests: use a "version cookie" — the write returns a version stamp; the next read carries it; the read either waits or routes to primary.
- List-after-create: the new record may not appear in the listing for a few hundred ms. Either send the listing to primary or surface the new record optimistically in the UI.
This is eventual consistency in disguise. Document it; expect it.
Eventual consistency UX
When eventual consistency is exposed to users:
- Communicate optimistically: show the new state immediately in the UI, even if the read hasn't caught up.
- Reconcile on next reload: if the optimistic state was wrong, show the truth, with an explanation.
- Avoid surfaces where strict consistency is expected (financial balances, audit logs).
Distributed transactions
Avoid. They are slow, fragile, and cap throughput.
When you genuinely need atomicity across two stores:
- Saga: a sequence of local transactions + compensations. Each step can fail; compensate the prior steps. Common for orchestrated workflows.
- Outbox: write changes to an outbox table in the same transaction as the business write; a separate process publishes the outbox events.
- Two-phase commit: only when latency / availability constraints allow. Rare in modern systems.
The discipline: prefer single-store atomic operations + sagas + outboxes over distributed transactions.
Distributed ID generation
Auto-increment IDs do not work across shards.
| Approach | Pros | Cons |
|---|---|---|
| UUID v4 | Trivial; collision-free | Random insert order kills B-tree performance |
| UUID v7 / ULID | Sortable + collision-free | Standard support varies |
| Snowflake | Sortable + compact | Coordination layer; clock-sensitive |
| KSUID | Sortable; URL-safe | Larger than auto-increment |
| Pre-allocated ranges per shard | Sortable + fast | Coordination at allocation time |
Default: ULID. Sortable, collision-free, compact, library support broad.
Caching tiers
When the database is the bottleneck, caching tiers absorb load:
- In-process cache: per-process; ~ms latency; short TTL. For read-heavy, low-mutation data.
- Distributed cache (Redis, Memcached): cross-process; sub-ms in-region; medium TTL. For shared-read data.
- CDN: edge cache; ~ms latency globally; long TTL. For public content + static assets.
Cache invalidation is the second hard problem. Three strategies:
- Time-based (TTL): simple; stale-but-bounded.
- Event-based: on write, evict / update relevant cache entries. Complex; risk of bugs.
- Versioned keys: each entity has a version; key includes version; updates produce new keys.
Versioned keys are surprisingly powerful; consider them before event-based invalidation.
Common failure modes
- Adopting sharding before exhausting a vertical scale. A single big primary handles enormous load; sharding adds complexity for no benefit. → Measure first; vertical-scale first.
- Shard key chosen by intuition, not data. Hot shard; resharding misery. → Analyze actual query patterns.
- Cross-shard queries everywhere. Sharded but with the cost of unsharded. → Audit queries; >95% should be single-shard.
- Replication lag ignored in code. Write-then-read inconsistency surfaces as random bugs. → Explicit primary/replica routing.
- Multi-region without RPO/RTO documented. Failover happens; nobody knows what was lost. → Document; drill.
- Cache invalidation via event-broadcast, no fallback. Event missed → stale forever. → TTL as a backstop on every cache.
See also
anti-overengineering.md— distributed data is the canonical over-engineering trap.multi-region-pattern.md— operational concerns at region scope.../security/multi-tenant-isolation-pattern.md— tenancy + sharding interplay.../quality/observability-pattern.md— measure replication lag, cache hit rate, query distribution.
Contracts — Zod Method Registry Pattern
TS-concrete recipe for a typed JSON-RPC / HTTP / IPC boundary. Scales to several hundred methods across dozens of namespaces in a real production codebase.
Error Hierarchy
How to design an error model that survives multi-agent development and client-side pattern matching.