Agents Playbook
Pillars/Architecture

Offline-First + Sync Pattern

How to design apps that work without network connectivity and reconcile state when connectivity returns.

Offline-First + Sync Pattern

How to design apps that work without network connectivity and reconcile state when connectivity returns.

TL;DR (human)

Offline-first means the local copy is the truth; sync to remote when possible. Designed right: instant interactions, work-anywhere, automatic reconciliation. Designed wrong: lost data, conflicting state, user confusion. Three problems to solve: local persistence, sync protocol, conflict resolution. CRDTs simplify the last; explicit policy works otherwise.

For agents

Three concerns

ConcernQuestionTools
Local persistenceWhere does data live offline?IndexedDB (web), SQLite (mobile), file system (desktop)
Sync protocolHow do client + server reconcile?Custom REST + diff, GraphQL subscriptions, CouchDB-style replication
Conflict resolutionWhen local and remote disagree, who wins?CRDT auto-merge, last-writer-wins, manual resolution UX

Local persistence

Web: IndexedDB (browser-native; large quotas; structured) > localStorage (small; sync). Wrappers: Dexie, idb, RxDB.

Mobile: SQLite (cross-platform; mature; queryable) > AsyncStorage (key-value only). Wrappers: WatermelonDB, Realm.

Desktop: SQLite or file-based; per-platform OS-native options.

The local store mirrors the server's data shape. Reads are local; writes are local-first (then sync).

Sync protocol

Sync protocol is the contract between client + server. Three styles:

Full replication: client downloads everything for its scope. Re-sync == replace. Works for small datasets per user.

Incremental sync:

  • Client tracks last-sync timestamp / version.
  • Server returns changes since.
  • Client applies; updates last-sync.

Requires server-side change log (per event-streaming-pattern.md) or efficient timestamp queries.

Operational transform / CRDT replication: each side captures operations or CRDT updates; merge convergent regardless of order.

Operation log on client

Local writes captured as operations (not just state mutations):

type LocalOp = {
  id: string;
  type: "create" | "update" | "delete";
  entity: string;
  payload: unknown;
  clientTimestamp: number;
  status: "pending" | "synced" | "failed";
};

Operations sit in a local queue. When online: replay in order to server; mark synced. Server returns canonical-state diffs.

Conflict resolution strategies

StrategyWhen
Last-writer-wins (LWW)Strict total order via timestamps; data class allows loss
First-writer-wins"Once set, immutable" semantics
CRDTMath guarantees convergence regardless of order
ManualUI shows conflict; user picks
Custom semanticDomain-specific merge

CRDT is the cleanest if your data shape fits (counters, text editors, lists, sets). Outside those shapes: LWW + careful UX.

Clock discipline

Conflict resolution often uses timestamps. Clock skew breaks it:

  • Use server time when sync happens (server stamps).
  • Hybrid Logical Clocks (HLC) for distributed correctness without strict NTP.
  • Lamport / vector clocks for strict ordering.

Don't use client wall-clock alone — devices drift, users adjust manually.

Sync state machine per record

local-only → syncing → synced
local-only → syncing → failed → retry
synced → modified-locally → syncing → ...

Each record's status determines UI behaviour:

  • local-only: badge "Saving locally".
  • syncing: subtle activity indicator.
  • synced: clean.
  • failed: error state with retry.

Offline UX

Tell the user:

  • Network status: visible (header banner when offline).
  • Sync progress: subtle (post-success), explicit (when active sync is non-trivial).
  • Per-record state: badges or icons for in-progress / failed.
  • Conflicts (rare): explicit UX to resolve.

Never silently lose data. If a sync fails permanently, surface it.

Authentication offline

Tricky:

  • Tokens issued before going offline still work locally.
  • Refresh fails offline; access token may expire mid-offline session.
  • Permission changes on server don't reach client until online.

Pattern: cached permissions; long-lived offline access token; re-auth when online if expired during offline window. Step-up operations (per ../security/session-mgmt-pattern.md) require online.

Service workers (web)

For PWAs:

  • Service worker intercepts requests.
  • Cache strategies: cache-first (static assets), network-first (API), stale-while-revalidate.
  • Background sync API: queue writes; replay when online.

Service workers add complexity; only adopt when offline is a core product requirement.

Sync at scale

For each tenant in multi-tenant systems:

  • Sync state per (user, device).
  • Sync windows: tenants don't see other tenants' streams.
  • Large dataset users may need progressive sync (paginated by time / entity).

Local data enables local search:

  • IndexedDB indexes for fast lookup.
  • SQLite FTS5 for full-text search.
  • Bloom filters for membership tests in larger datasets.

Search latency dominates UX; local search makes the app feel instant.

When NOT offline-first

  • Single-session web tools.
  • Sensitive data that shouldn't sit on client devices.
  • Truly real-time-only (live video).
  • Trivial CRUD where online assumption is acceptable.

Offline-first is a significant architecture commitment. Adopt when network unreliability is part of the product reality.

Common failure modes

  • Local data + no sync UI: user wonders if their work persisted. → Per-record status.
  • LWW without HLC: clock skew = wrong winner. → HLC or server stamps.
  • No conflict UX for high-stakes data: silent loss. → Manual resolution; never silently overwrite.
  • No retry policy: failed sync stays failed. → Exponential backoff; resume on connectivity.
  • Token expiry during offline: user logged out unexpectedly. → Long offline window; clear re-auth on return.
  • All data synced for every user: storage explodes; sync slow. → Scope (tenant; recent; visible).
  • No "i lost this" recovery path: data gone if device wiped. → Server is canonical; local is cache; loss recoverable from server if synced.

Tooling stack (typical)

ConcernTool
Local DB (web)IndexedDB direct, Dexie, idb, RxDB
Local DB (mobile)SQLite, WatermelonDB, Realm
Sync frameworkPouchDB / CouchDB, Replicache, Electric SQL, PowerSync
CRDTYjs (Y.js), Automerge, Loro
Operation queueCustom; or sync framework's built-in
Service workerWorkbox

Adoption path

  1. Day 0: no offline support. Online-required. Document.
  2. If offline needed: pick scope (read-only offline first; then writes; then full offline).
  3. Choose strategy: LWW for simple; CRDT for collaborative; manual UX for high-stakes.
  4. Per-record sync state in UI.
  5. Test: throttled / disconnected / re-connect cycles + multi-device.
  6. Drill: conflict scenarios; data loss recovery.

See also