Agents Playbook
Pillars/Architecture

Distributed Data Pattern

How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.

Distributed Data Pattern

How to design data layout when one database stops being enough — read replicas, sharding, replication lag, CAP trade-offs, eventual consistency.

TL;DR (human)

Distributed data starts with read replicas (cheap, mostly transparent). Then sharding (expensive, design-defining). Then multi-region (operationally hard, recovery-defining). Each step trades consistency for availability and complexity. Pick the cheapest one that solves the actual problem; do not adopt the next tier speculatively.

For agents

The CAP triangle, briefly

A distributed system under partition can guarantee at most two of: Consistency, Availability, Partition tolerance. Real systems are not on the corners — they pick a position on the edges.

  • CP (consistency over availability under partition): banking, audit ledgers. Reads/writes refuse if quorum unreachable.
  • AP (availability over consistency): social feed, analytics. Reads/writes succeed; data is eventually consistent.
  • CA (no partition tolerance — only viable in a single node).

You will be AP for most user-facing data and CP for money + audit + identity.

Step 1 — Read replicas

Cheap, mostly transparent. One primary handles writes; N replicas serve reads.

Rules:

  • Writes go to primary. Always.
  • Reads with strict freshness go to primary. Authentication, "did my write land", post-transaction reads.
  • Reads that tolerate staleness go to replicas. Listings, dashboards, analytics.
  • The application layer chooses: db.replica.users.list(...) vs db.primary.users.list(...). Not the ORM's auto-magic. Auto-routing produces surprise replication-lag bugs.

Replication lag:

  • Typically tens to hundreds of milliseconds in steady state.
  • Spikes to seconds under load.
  • Tail can reach minutes during failover.

Design your queries to tolerate the worst-case lag, or send the affected query to primary.

Step 2 — Sharding

When data per primary exceeds what one node handles — typically when total data approaches a TB or QPS exceeds 10k+ — shard.

Shard key choice is permanent (or at least very expensive to change). Get it right.

Good shard keys:

  • tenant_id / workspace_id for multi-tenant systems (most queries are per-tenant).
  • user_id for user-facing systems.
  • Time-bucketed for append-heavy systems (event logs).

Bad shard keys:

  • Auto-increment id (sequential = hot last shard).
  • created_at only (hot active shard).
  • Anything that produces "one big shard" (one popular tenant).

Cross-shard queries are expensive. Design queries so 95%+ stay within a shard.

Resharding is its own discipline:

  • Pre-split into more shards than you currently need (over-shard).
  • Use logical shards mapped to physical nodes; moving a logical shard is a node-add operation.
  • Tools: Vitess, Citus, application-level sharding with consistent hashing.

Step 3 — Multi-region

When users are distributed globally OR the failure of one region must not take the system down — multi-region.

Three patterns:

  1. Active-passive: one region writes; others stand by. Failover is operator-driven; RPO = replication lag, RTO = minutes.
  2. Active-active with leader per partition: each partition (tenant, customer, geographic block) has a leader region. Writes to your data only succeed in your region. Cross-partition operations are rare and expensive.
  3. Fully active-active with CRDT / multi-leader: writes succeed anywhere; conflicts resolved at the data layer. Expensive but powerful.

Most products start at 1. Mature SaaS at 2. Few need 3.

RPO / RTO

Per system, document:

  • RPO (Recovery Point Objective): how much data can we lose in a disaster? "5 minutes" means replication is configured for ≤ 5-min lag.
  • RTO (Recovery Time Objective): how long to be back up? "30 minutes" means the failover procedure must complete within that.

RPO and RTO are promises. The infrastructure must be able to deliver them; the runbook must be tested.

Replication lag — visible in product

When you have replicas, replication lag becomes a product concern:

  • Write-then-read in the same request: route the read to primary or use a session-pinned router.
  • Write-then-read across requests: use a "version cookie" — the write returns a version stamp; the next read carries it; the read either waits or routes to primary.
  • List-after-create: the new record may not appear in the listing for a few hundred ms. Either send the listing to primary or surface the new record optimistically in the UI.

This is eventual consistency in disguise. Document it; expect it.

Eventual consistency UX

When eventual consistency is exposed to users:

  • Communicate optimistically: show the new state immediately in the UI, even if the read hasn't caught up.
  • Reconcile on next reload: if the optimistic state was wrong, show the truth, with an explanation.
  • Avoid surfaces where strict consistency is expected (financial balances, audit logs).

Distributed transactions

Avoid. They are slow, fragile, and cap throughput.

When you genuinely need atomicity across two stores:

  • Saga: a sequence of local transactions + compensations. Each step can fail; compensate the prior steps. Common for orchestrated workflows.
  • Outbox: write changes to an outbox table in the same transaction as the business write; a separate process publishes the outbox events.
  • Two-phase commit: only when latency / availability constraints allow. Rare in modern systems.

The discipline: prefer single-store atomic operations + sagas + outboxes over distributed transactions.

Distributed ID generation

Auto-increment IDs do not work across shards.

ApproachProsCons
UUID v4Trivial; collision-freeRandom insert order kills B-tree performance
UUID v7 / ULIDSortable + collision-freeStandard support varies
SnowflakeSortable + compactCoordination layer; clock-sensitive
KSUIDSortable; URL-safeLarger than auto-increment
Pre-allocated ranges per shardSortable + fastCoordination at allocation time

Default: ULID. Sortable, collision-free, compact, library support broad.

Caching tiers

When the database is the bottleneck, caching tiers absorb load:

  1. In-process cache: per-process; ~ms latency; short TTL. For read-heavy, low-mutation data.
  2. Distributed cache (Redis, Memcached): cross-process; sub-ms in-region; medium TTL. For shared-read data.
  3. CDN: edge cache; ~ms latency globally; long TTL. For public content + static assets.

Cache invalidation is the second hard problem. Three strategies:

  • Time-based (TTL): simple; stale-but-bounded.
  • Event-based: on write, evict / update relevant cache entries. Complex; risk of bugs.
  • Versioned keys: each entity has a version; key includes version; updates produce new keys.

Versioned keys are surprisingly powerful; consider them before event-based invalidation.

Common failure modes

  • Adopting sharding before exhausting a vertical scale. A single big primary handles enormous load; sharding adds complexity for no benefit. → Measure first; vertical-scale first.
  • Shard key chosen by intuition, not data. Hot shard; resharding misery. → Analyze actual query patterns.
  • Cross-shard queries everywhere. Sharded but with the cost of unsharded. → Audit queries; >95% should be single-shard.
  • Replication lag ignored in code. Write-then-read inconsistency surfaces as random bugs. → Explicit primary/replica routing.
  • Multi-region without RPO/RTO documented. Failover happens; nobody knows what was lost. → Document; drill.
  • Cache invalidation via event-broadcast, no fallback. Event missed → stale forever. → TTL as a backstop on every cache.

See also