How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.

Distributed Data Pattern

How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.

TL;DR (human)

Distributed data starts with read replicas (cheap, mostly transparent). Then sharding (expensive, design-defining). Then multi-region (operationally hard, recovery-defining). Each step trades consistency for availability and complexity. Pick the cheapest one that solves the actual problem; do not adopt the next tier speculatively.

For agents

The CAP triangle, briefly

A distributed system under partition can guarantee at most two of: Consistency, Availability, Partition tolerance. Real systems are not on the corners — they pick a position on the edges.

CP (consistency over availability under partition): banking, audit ledgers. Reads/writes refuse if quorum unreachable.
AP (availability over consistency): social feed, analytics. Reads/writes succeed; data is eventually consistent.
CA (no partition tolerance — only viable in a single node).

You will be AP for most user-facing data and CP for money + audit + identity.

Step 1 — Read replicas

Cheap, mostly transparent. One primary handles writes; N replicas serve reads.

Rules:

Writes go to primary. Always.
Reads with strict freshness go to primary. Authentication, "did my write land", post-transaction reads.
Reads that tolerate staleness go to replicas. Listings, dashboards, analytics.
The application layer chooses: db.replica.users.list(...) vs db.primary.users.list(...). Not the ORM's auto-magic. Auto-routing produces surprise replication-lag bugs.

Replication lag:

Typically tens to hundreds of milliseconds in steady state.
Spikes to seconds under load.
Tail can reach minutes during failover.

Design your queries to tolerate the worst-case lag, or send the affected query to primary.

Step 2 — Sharding

When data per primary exceeds what one node handles — typically when total data approaches a TB or QPS exceeds 10k+ — shard.

Shard key choice is permanent (or at least very expensive to change). Get it right.

Good shard keys:

tenant_id / workspace_id for multi-tenant systems (most queries are per-tenant).
user_id for user-facing systems.
Time-bucketed for append-heavy systems (event logs).

Bad shard keys:

Auto-increment id (sequential = hot last shard).
created_at only (hot active shard).
Anything that produces "one big shard" (one popular tenant).

Cross-shard queries are expensive. Design queries so 95%+ stay within a shard.

Resharding is its own discipline:

Pre-split into more shards than you currently need (over-shard).
Use logical shards mapped to physical nodes; moving a logical shard is a node-add operation.
Tools: Vitess, Citus, application-level sharding with consistent hashing.

Step 3 — Multi-region

When users are distributed globally OR the failure of one region must not take the system down — multi-region.

Three patterns:

Active-passive: one region writes; others stand by. Failover is operator-driven; RPO = replication lag, RTO = minutes.
Active-active with leader per partition: each partition (tenant, customer, geographic block) has a leader region. Writes to your data only succeed in your region. Cross-partition operations are rare and expensive.
Fully active-active with CRDT / multi-leader: writes succeed anywhere; conflicts resolved at the data layer. Expensive but powerful.

Most products start at 1. Mature SaaS at 2. Few need 3.

RPO / RTO

Per system, document:

RPO (Recovery Point Objective): how much data can we lose in a disaster? "5 minutes" means replication is configured for ≤ 5-min lag.
RTO (Recovery Time Objective): how long to be back up? "30 minutes" means the failover procedure must complete within that.

RPO and RTO are promises. The infrastructure must be able to deliver them; the runbook must be tested.

Replication lag — visible in product

When you have replicas, replication lag becomes a product concern:

Write-then-read in the same request: route the read to primary or use a session-pinned router.
Write-then-read across requests: use a "version cookie" — the write returns a version stamp; the next read carries it; the read either waits or routes to primary.
List-after-create: the new record may not appear in the listing for a few hundred ms. Either send the listing to primary or surface the new record optimistically in the UI.

This is eventual consistency in disguise. Document it; expect it.

Eventual consistency UX

When eventual consistency is exposed to users:

Communicate optimistically: show the new state immediately in the UI, even if the read hasn't caught up.
Reconcile on next reload: if the optimistic state was wrong, show the truth, with an explanation.
Avoid surfaces where strict consistency is expected (financial balances, audit logs).

Distributed transactions

Avoid. They are slow, fragile, and cap throughput.

When you genuinely need atomicity across two stores:

Saga: a sequence of local transactions + compensations. Each step can fail; compensate the prior steps. Common for orchestrated workflows.
Outbox: write changes to an outbox table in the same transaction as the business write; a separate process publishes the outbox events.
Two-phase commit: only when latency / availability constraints allow. Rare in modern systems.

The discipline: prefer single-store atomic operations + sagas + outboxes over distributed transactions.

Distributed ID generation

Auto-increment IDs do not work across shards.

Approach	Pros	Cons
UUID v4	Trivial; collision-free	Random insert order kills B-tree performance
UUID v7 / ULID	Sortable + collision-free	Standard support varies
Snowflake	Sortable + compact	Coordination layer; clock-sensitive
KSUID	Sortable; URL-safe	Larger than auto-increment
Pre-allocated ranges per shard	Sortable + fast	Coordination at allocation time

Default: ULID. Sortable, collision-free, compact, library support broad.

Caching tiers

When the database is the bottleneck, caching tiers absorb load:

In-process cache: per-process; ~ms latency; short TTL. For read-heavy, low-mutation data.
Distributed cache (Redis, Memcached): cross-process; sub-ms in-region; medium TTL. For shared-read data.
CDN: edge cache; ~ms latency globally; long TTL. For public content + static assets.

Cache invalidation is the second hard problem. Three strategies:

Time-based (TTL): simple; stale-but-bounded.
Event-based: on write, evict / update relevant cache entries. Complex; risk of bugs.
Versioned keys: each entity has a version; key includes version; updates produce new keys.

Versioned keys are surprisingly powerful; consider them before event-based invalidation.

Common failure modes

Adopting sharding before exhausting a vertical scale. A single big primary handles enormous load; sharding adds complexity for no benefit. → Measure first; vertical-scale first.
Shard key chosen by intuition, not data. Hot shard; resharding misery. → Analyze actual query patterns.
Cross-shard queries everywhere. Sharded but with the cost of unsharded. → Audit queries; >95% should be single-shard.
Replication lag ignored in code. Write-then-read inconsistency surfaces as random bugs. → Explicit primary/replica routing.
Multi-region without RPO/RTO documented. Failover happens; nobody knows what was lost. → Document; drill.
Cache invalidation via event-broadcast, no fallback. Event missed → stale forever. → TTL as a backstop on every cache.

Distributed Data Pattern

Distributed Data Pattern

TL;DR (human)

For agents

The CAP triangle, briefly

Step 1 — Read replicas

Step 2 — Sharding

Step 3 — Multi-region

RPO / RTO

Replication lag — visible in product

Eventual consistency UX

Distributed transactions

Distributed ID generation

Caching tiers

Common failure modes

See also

On this page