Event Streaming Pattern
How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers.
Event Streaming Pattern
How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers.
TL;DR (human)
Three primitives: queues (work-distribution; one consumer per message), pub/sub topics (broadcast; many independent consumers), event streams (durable log; replayable; ordered per partition). Choose by use case. Discipline: idempotency on every consumer, dead-letter queue, schema evolution, replay tooling, backpressure handling. The hardest mistakes are subtle — re-delivery semantics, ordering guarantees, exactly-once myths.
For agents
Three primitives
| Primitive | Semantics | Use case | Examples |
|---|---|---|---|
| Queue | At-least-once; one consumer per msg; FIFO or fair | Work distribution (jobs) | SQS, RabbitMQ, BullMQ |
| Pub/sub topic | At-least-once; fanout to N subscribers; usually no ordering | Notify many independent consumers | SNS, Cloud Pub/Sub, Redis pub/sub |
| Event stream | Durable log; ordered per partition; replayable | Event-sourced systems; analytics pipelines | Kafka, Kinesis, NATS JetStream, Redpanda |
Mixing primitives is normal (queue + topic + stream). Picking the wrong one is costly.
Delivery semantics — the truth
Three theoretical options:
- At-most-once: messages may be lost; never duplicated.
- At-least-once: messages always delivered; may be duplicated.
- Exactly-once: each message processed exactly once.
In practice:
- Most production systems are at-least-once.
- "Exactly-once" usually means at-least-once + idempotent consumer.
- True end-to-end exactly-once exists in some systems (Kafka transactions + transactional sinks) but is expensive and narrow.
Design for at-least-once + idempotency. It is the most cost-effective and most robust pattern.
Idempotency — non-negotiable
Every consumer must handle duplicate delivery. Pattern:
async function handler(msg: Message) {
const idempotencyKey = msg.headers["x-idempotency-key"] ?? msg.id;
// Has this been processed?
const existing = await db.processedMessages.findUnique({ where: { idempotencyKey } });
if (existing) {
logger.info("duplicate.skipped", { idempotencyKey });
return existing.result;
}
// Process atomically with idempotency record.
return await db.transaction(async (tx) => {
const result = await doWork(msg, tx);
await tx.processedMessages.create({ data: { idempotencyKey, result } });
return result;
});
}The idempotency record + the side-effect commit in one transaction. Half-states get the producer to retry safely.
Where transactions cross stores (e.g. external API + local DB), apply the outbox pattern (see distributed-data-pattern.md).
Ordering guarantees
| Primitive | Ordering |
|---|---|
| Standard SQS | No order guaranteed |
| SQS FIFO | Per message-group ordered |
| Kafka / Kinesis | Per partition ordered |
| Redis Streams | Per stream ordered |
| RabbitMQ classic queues | Per queue ordered (but with consumer caveats) |
The shard / partition key chooses ordering scope. Common choice: tenant id (events per tenant ordered; cross-tenant unordered).
If the consumer needs global order, you have one partition; you have one consumer's throughput; you have a bottleneck. Avoid.
Dead-letter queue (DLQ)
Messages that fail repeatedly route to DLQ:
- After N retries (e.g. 5).
- Or after specific terminal errors (validation failure, missing entity).
DLQ is inspected — manually or via tooling. Each DLQ message is a bug:
- The message is malformed (producer bug).
- The handler has a regression (consumer bug).
- An upstream dependency is permanently down (deeper issue).
Never silently delete DLQ. Inspect; fix; replay.
Backpressure
When consumers are slow relative to producers, the queue grows.
Options:
- Auto-scale consumers: more workers; faster drain. Bounded by downstream capacity (DB, external APIs).
- Producer back-pressure: producers slow down on queue-depth signal. Hard to retrofit.
- Drop oldest (queue-depth cap): some workloads tolerate it (notifications). Most don't.
- Spillover: route to slower / cheaper storage at queue-depth threshold.
The wrong answer is to silently fall behind. Set alerts on queue depth + age of oldest message.
Schema evolution
Producers + consumers deploy independently. Their schemas must coexist across versions.
Rules:
- Add fields: new fields are optional with defaults. Old consumers ignore.
- Rename fields: requires deprecation cycle (per
api-versioning-pattern.md) — keep both names during transition. - Remove fields: requires guarantee no consumer reads them. Audit; deprecate; remove.
- Type changes: breaking; new event name preferred.
A schema registry (Confluent Schema Registry, Glue Schema Registry, in-house) enforces compatibility:
- Producer registers schema at publish time.
- Compatibility check: would old consumers parse this?
- Reject incompatible schemas at publish.
Without a registry, schema drift produces consumer crashes that are hard to diagnose.
Replay + reprocessing
For event streams (durable):
- Replay from offset: rewind a consumer; reprocess from N.
- Replay to dev: snapshot prod stream; replay locally.
- Backfill: a new consumer joins; processes the full history.
For queues (non-durable):
- Replay = manually re-publishing from logs / archive.
Tooling discipline: a replay command exists; it's safe; it's tested.
Idempotency across replays
Replaying produces duplicates. The same idempotency-key pattern handles it — as long as the idempotency-keys are stable across runs.
Counter-example: idempotencyKey = uuid() generated at processing time → every replay produces a "new" message. Stable keys are essential.
Event sourcing — the heavyweight pattern
Some systems persist only the event stream; current state is a projection.
Pros: full audit trail; replay rebuilds state; new projections retroactively serve new use cases.
Cons: every query goes through projections; schema evolution is hard; migrations are replays.
Adopt event sourcing deliberately, not by accident. It is a significant architecture commitment.
For most products: regular CRUD with an outbox of domain events is the right balance. Full event sourcing for systems where the event history IS the value (audit, financial systems, multi-step workflows).
CQRS — the companion pattern
Command Query Responsibility Segregation: write models and read models differ.
- Commands go to one shape (often the event stream).
- Queries hit one or more projections optimised for the query shape.
Useful when:
- Read and write loads are very different.
- Multiple read projections benefit from the same write events.
- Eventual consistency is acceptable for reads.
Overhead: two models to maintain; eventual consistency to communicate.
Event-driven UX
When the user triggers an action that goes async:
- Optimistic UI: show success immediately; reconcile on event-back.
- Status surface: explicit "running…" / "completed" / "failed".
- Idempotent retries: user clicks twice; second click finds the in-flight job.
Don't hide async-ness from the user; surface it.
Cost concerns
Event-streaming infrastructure costs:
- Per-message: SQS, SNS price per million.
- Per-throughput: Kafka, Kinesis charge for provisioned throughput.
- Per-storage: retention beyond 7 days = paid storage.
Tuning:
- Batch publishes where latency allows.
- Compress payloads (Snappy, gzip).
- Tune retention to actual replay window.
- Per-tenant tagging for attribution (per
../quality/cost-optimization-pattern.md).
Anti-patterns
- Synchronous-over-async: producer blocks waiting for consumer ack. Defeats decoupling.
- Event names that encode internal state:
UserRowVersion3UpdatedColumnX. Producers leak DB structure to consumers. - Fat events: 100 KB payloads. Consumers parse the whole world. → Small events + reference to canonical store.
- Anaemic events: just an id. Consumers re-fetch everything. → Include enough for common consumers.
- Topic per consumer: defeats decoupling. → One topic; many consumers.
- No DLQ: failing messages retry forever; queue grows; outage. → DLQ + alerts.
- No idempotency: duplicates produce double-charges, double-emails. → Idempotency-key everywhere.
Common operational failures
- Consumer lag spike → backpressure; investigate downstream.
- DLQ filling → consumer regression; inspect first message.
- Schema deploy breaks consumers → registry was bypassed; rollback; enforce registry.
- Replay duplicated work → idempotency-key not stable.
- Lost messages → at-most-once setting; switch to at-least-once + ack.
- Out-of-order in supposedly-ordered partition → consumer-side concurrency violates ordering; serialize.
Tooling stack (typical)
| Primitive | Tool |
|---|---|
| Managed queue | AWS SQS, GCP Tasks, Azure Service Bus |
| Self-hosted queue | RabbitMQ, BullMQ (Redis-backed) |
| Pub/sub | AWS SNS, GCP Pub/Sub, NATS |
| Event stream | Kafka, Confluent Cloud, Redpanda, AWS Kinesis, Azure Event Hubs |
| Schema registry | Confluent SR, AWS Glue SR, in-house JSON schema repo |
| Workflow engine | Temporal, AWS Step Functions, Inngest |
| Job scheduler | BullMQ, Sidekiq, Celery |
See also
distributed-data-pattern.md— outbox pattern feeds event streams.api-versioning-pattern.md— schema evolution rules.anti-overengineering.md— event sourcing is the canonical premature complexity.../quality/observability-pattern.md— queue depth, lag, DLQ size are key metrics.../quality/cost-optimization-pattern.md— event-stream cost.../security/audit-ledger-pattern.md— append-only ledger is a specialized event stream.