How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers.

Event Streaming Pattern

How to design async, decoupled communication via queues, pub/sub, and event streams — without losing events, double-processing, or stalling consumers.

TL;DR (human)

Three primitives: queues (work-distribution; one consumer per message), pub/sub topics (broadcast; many independent consumers), event streams (durable log; replayable; ordered per partition). Choose by use case. Discipline: idempotency on every consumer, dead-letter queue, schema evolution, replay tooling, backpressure handling. The hardest mistakes are subtle — re-delivery semantics, ordering guarantees, exactly-once myths.

For agents

Three primitives

Primitive	Semantics	Use case	Examples
Queue	At-least-once; one consumer per msg; FIFO or fair	Work distribution (jobs)	SQS, RabbitMQ, BullMQ
Pub/sub topic	At-least-once; fanout to N subscribers; usually no ordering	Notify many independent consumers	SNS, Cloud Pub/Sub, Redis pub/sub
Event stream	Durable log; ordered per partition; replayable	Event-sourced systems; analytics pipelines	Kafka, Kinesis, NATS JetStream, Redpanda

Mixing primitives is normal (queue + topic + stream). Picking the wrong one is costly.

Delivery semantics — the truth

Three theoretical options:

At-most-once: messages may be lost; never duplicated.
At-least-once: messages always delivered; may be duplicated.
Exactly-once: each message processed exactly once.

In practice:

Most production systems are at-least-once.
"Exactly-once" usually means at-least-once + idempotent consumer.
True end-to-end exactly-once exists in some systems (Kafka transactions + transactional sinks) but is expensive and narrow.

Design for at-least-once + idempotency. It is the most cost-effective and most robust pattern.

Idempotency — non-negotiable

Every consumer must handle duplicate delivery. Pattern:

async function handler(msg: Message) {
  const idempotencyKey = msg.headers["x-idempotency-key"] ?? msg.id;

  // Has this been processed?
  const existing = await db.processedMessages.findUnique({ where: { idempotencyKey } });
  if (existing) {
    logger.info("duplicate.skipped", { idempotencyKey });
    return existing.result;
  }

  // Process atomically with idempotency record.
  return await db.transaction(async (tx) => {
    const result = await doWork(msg, tx);
    await tx.processedMessages.create({ data: { idempotencyKey, result } });
    return result;
  });
}

The idempotency record + the side-effect commit in one transaction. Half-states get the producer to retry safely.

Where transactions cross stores (e.g. external API + local DB), apply the outbox pattern (see distributed-data-pattern.md).

Ordering guarantees

Primitive	Ordering
Standard SQS	No order guaranteed
SQS FIFO	Per message-group ordered
Kafka / Kinesis	Per partition ordered
Redis Streams	Per stream ordered
RabbitMQ classic queues	Per queue ordered (but with consumer caveats)

The shard / partition key chooses ordering scope. Common choice: tenant id (events per tenant ordered; cross-tenant unordered).

If the consumer needs global order, you have one partition; you have one consumer's throughput; you have a bottleneck. Avoid.

Dead-letter queue (DLQ)

Messages that fail repeatedly route to DLQ:

After N retries (e.g. 5).
Or after specific terminal errors (validation failure, missing entity).

DLQ is inspected — manually or via tooling. Each DLQ message is a bug:

The message is malformed (producer bug).
The handler has a regression (consumer bug).
An upstream dependency is permanently down (deeper issue).

Never silently delete DLQ. Inspect; fix; replay.

Backpressure

When consumers are slow relative to producers, the queue grows.

Options:

Auto-scale consumers: more workers; faster drain. Bounded by downstream capacity (DB, external APIs).
Producer back-pressure: producers slow down on queue-depth signal. Hard to retrofit.
Drop oldest (queue-depth cap): some workloads tolerate it (notifications). Most don't.
Spillover: route to slower / cheaper storage at queue-depth threshold.

The wrong answer is to silently fall behind. Set alerts on queue depth + age of oldest message.

Schema evolution

Producers + consumers deploy independently. Their schemas must coexist across versions.

Rules:

Add fields: new fields are optional with defaults. Old consumers ignore.
Rename fields: requires deprecation cycle (per api-versioning-pattern.md) — keep both names during transition.
Remove fields: requires guarantee no consumer reads them. Audit; deprecate; remove.
Type changes: breaking; new event name preferred.

A schema registry (Confluent Schema Registry, Glue Schema Registry, in-house) enforces compatibility:

Producer registers schema at publish time.
Compatibility check: would old consumers parse this?
Reject incompatible schemas at publish.

Without a registry, schema drift produces consumer crashes that are hard to diagnose.

Replay + reprocessing

For event streams (durable):

Replay from offset: rewind a consumer; reprocess from N.
Replay to dev: snapshot prod stream; replay locally.
Backfill: a new consumer joins; processes the full history.

For queues (non-durable):

Replay = manually re-publishing from logs / archive.

Tooling discipline: a replay command exists; it's safe; it's tested.

Idempotency across replays

Replaying produces duplicates. The same idempotency-key pattern handles it — as long as the idempotency-keys are stable across runs.

Counter-example: idempotencyKey = uuid() generated at processing time → every replay produces a "new" message. Stable keys are essential.

Event sourcing — the heavyweight pattern

Some systems persist only the event stream; current state is a projection.

Pros: full audit trail; replay rebuilds state; new projections retroactively serve new use cases.

Cons: every query goes through projections; schema evolution is hard; migrations are replays.

Adopt event sourcing deliberately, not by accident. It is a significant architecture commitment.

For most products: regular CRUD with an outbox of domain events is the right balance. Full event sourcing for systems where the event history IS the value (audit, financial systems, multi-step workflows).

CQRS — the companion pattern

Command Query Responsibility Segregation: write models and read models differ.

Commands go to one shape (often the event stream).
Queries hit one or more projections optimised for the query shape.

Useful when:

Read and write loads are very different.
Multiple read projections benefit from the same write events.
Eventual consistency is acceptable for reads.

Overhead: two models to maintain; eventual consistency to communicate.

Event-driven UX

When the user triggers an action that goes async:

Optimistic UI: show success immediately; reconcile on event-back.
Status surface: explicit "running…" / "completed" / "failed".
Idempotent retries: user clicks twice; second click finds the in-flight job.

Don't hide async-ness from the user; surface it.

Cost concerns

Event-streaming infrastructure costs:

Per-message: SQS, SNS price per million.
Per-throughput: Kafka, Kinesis charge for provisioned throughput.
Per-storage: retention beyond 7 days = paid storage.

Tuning:

Batch publishes where latency allows.
Compress payloads (Snappy, gzip).
Tune retention to actual replay window.
Per-tenant tagging for attribution (per ../quality/cost-optimization-pattern.md).

Anti-patterns

Synchronous-over-async: producer blocks waiting for consumer ack. Defeats decoupling.
Event names that encode internal state: UserRowVersion3UpdatedColumnX. Producers leak DB structure to consumers.
Fat events: 100 KB payloads. Consumers parse the whole world. → Small events + reference to canonical store.
Anaemic events: just an id. Consumers re-fetch everything. → Include enough for common consumers.
Topic per consumer: defeats decoupling. → One topic; many consumers.
No DLQ: failing messages retry forever; queue grows; outage. → DLQ + alerts.
No idempotency: duplicates produce double-charges, double-emails. → Idempotency-key everywhere.

Common operational failures

Consumer lag spike → backpressure; investigate downstream.
DLQ filling → consumer regression; inspect first message.
Schema deploy breaks consumers → registry was bypassed; rollback; enforce registry.
Replay duplicated work → idempotency-key not stable.
Lost messages → at-most-once setting; switch to at-least-once + ack.
Out-of-order in supposedly-ordered partition → consumer-side concurrency violates ordering; serialize.

Tooling stack (typical)

Primitive	Tool
Managed queue	AWS SQS, GCP Tasks, Azure Service Bus
Self-hosted queue	RabbitMQ, BullMQ (Redis-backed)
Pub/sub	AWS SNS, GCP Pub/Sub, NATS
Event stream	Kafka, Confluent Cloud, Redpanda, AWS Kinesis, Azure Event Hubs
Schema registry	Confluent SR, AWS Glue SR, in-house JSON schema repo
Workflow engine	Temporal, AWS Step Functions, Inngest
Job scheduler	BullMQ, Sidekiq, Celery

Event Streaming Pattern

Event Streaming Pattern

TL;DR (human)

For agents

Three primitives

Delivery semantics — the truth

Idempotency — non-negotiable

Ordering guarantees

Dead-letter queue (DLQ)

Backpressure

Schema evolution

Replay + reprocessing

Idempotency across replays

Event sourcing — the heavyweight pattern

CQRS — the companion pattern

Event-driven UX

Cost concerns

Anti-patterns

Common operational failures

Tooling stack (typical)

See also

On this page