The Transactional Outbox Pattern: Reliable Event Publishing Explained
A payments team I worked with last year shipped a regression that took three weeks to spot. The order service committed orders to Postgres, then published an OrderCreated event to RabbitMQ on a separate connection. About 0.4% of the time, the database commit succeeded and the broker publish silently failed. Downstream, the fulfilment service never heard about those orders, and customers waited for parcels that nobody had been asked to pick. By the time the reconciliation job flagged the gap, 4,217 orders had slipped through.
That class of bug is the dual-write problem, and it has a single textbook answer: the transactional outbox pattern. It is unglamorous, it adds a table, and it removes an entire category of correctness incidents from your roadmap.
The Dual-Write Problem in One Picture
The naive flow looks reasonable on paper:
async function createOrder(input) {
const order = await db.orders.insert(input);
await broker.publish('order.created', order);
return order;
}
Two writes, two systems, no coordination. Five things can go wrong:
- The database commit fails. No publish. Consistent.
- The publish fails. The database write is already committed. Inconsistent.
- The publish succeeds but the ack is lost on the network. The code retries. Duplicate publish.
- The process crashes between the database commit and the publish call. Inconsistent.
- The broker accepts the message but never persists it. Rare with modern brokers but real with misconfigured ones.
Cases 2, 4 and 5 are all variants of the same problem: the database and the broker are separate transactional systems and there is no way to atomically commit to both. Distributed transactions (XA, two-phase commit) exist, are slow, and almost nobody uses them in modern microservice stacks for good reasons.
The outbox pattern removes the second write from the request path entirely. You only commit to one system, and the broker eventually catches up.
How the Outbox Pattern Works
The shape is simple:
- In the same database transaction as the business write, insert a row into an
outboxtable describing the event. - A separate worker reads unsent outbox rows and publishes them to the broker.
- When the broker acks, the worker marks the row as sent (or deletes it).
- If the worker crashes, the next run picks up where it left off because the row is still unsent.
CREATE TABLE outbox (
id BIGSERIAL PRIMARY KEY,
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
sent_at TIMESTAMPTZ
);
CREATE INDEX idx_outbox_unsent
ON outbox (created_at)
WHERE sent_at IS NULL;
The partial index is the small optimisation that keeps polling fast: only unsent rows are indexed, so the publisher’s query stays cheap even when the table grows to millions of rows.
The business write becomes:
await db.transaction(async (tx) => {
const order = await tx.orders.insert(input);
await tx.outbox.insert({
aggregate_type: 'order',
aggregate_id: order.id,
event_type: 'order.created',
payload: order,
});
return order;
});
One transaction, one durable system. If it commits, the event exists. If it rolls back, neither does. The broker publish is now somebody else’s problem.
The Publisher: Two Realistic Implementations
The outbox needs a process that reads unsent rows and ships them to the broker. There are two common approaches.
Polling publisher
The simplest version is a worker that runs every few hundred milliseconds:
async function pollOutbox() {
const rows = await db.query(`
SELECT id, event_type, payload
FROM outbox
WHERE sent_at IS NULL
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED
`);
for (const row of rows) {
await broker.publish(row.event_type, row.payload, {
messageId: String(row.id),
});
await db.query(
`UPDATE outbox SET sent_at = NOW() WHERE id = $1`,
[row.id],
);
}
}
FOR UPDATE SKIP LOCKED lets multiple publisher instances share the load without stepping on each other; each picks up rows the others have not locked. Postgres handles this cleanly. The pattern works on MySQL 8+ as well.
The messageId is the outbox row id, which makes the consumer’s deduplication job trivial: it stores seen ids and ignores repeats. This is the pattern covered in how to design idempotent API endpoints, applied to the consumer side.
Tradeoffs of polling:
- Latency floor equal to the poll interval. 200 ms is a sensible default; lower if you need it, but watch CPU.
- Database load proportional to the number of publisher instances. Usually trivial.
- Simple to operate. No extra infrastructure beyond what you already have.
Log-based publisher (change data capture)
The alternative is to read the database’s write-ahead log directly. Debezium’s outbox event router ↗ reads the outbox table from the WAL via Postgres logical replication ↗ and publishes to Kafka with millisecond latency.
This is the right choice when:
- You already operate Kafka and Debezium.
- You need sub-second event-to-broker latency.
- You publish a high volume of events and want to avoid the polling overhead.
It is the wrong choice when:
- You do not already run Debezium. The operational cost is not trivial.
- Your broker is not Kafka (or one of the few brokers Debezium supports).
- Your event volume is low enough that polling is invisible in the metrics.
For most teams below a few hundred events per second, polling wins on simplicity. Move to CDC when the polling overhead actually shows up in monitoring, not before. This is the same logic as the case for boring technology: pick the option you can operate at 3am.
Comparing the Two Publishers
| Concern | Polling publisher | Log-based publisher |
|---|---|---|
| Latency | 100 to 500 ms typical | Tens of ms |
| Infrastructure | Just your DB and broker | Plus Debezium, Connect, Kafka |
| Database load | Light, scales with polling rate | Negligible (reads WAL) |
| Broker support | Anything with a client library | Mostly Kafka, some others |
| Operational complexity | Low | Medium to high |
| Failure recovery | Restart the worker | Restart Connect, monitor lag |
| Best fit | Most services, mixed brokers | High-throughput Kafka shops |
What Can Still Go Wrong
The outbox is not magic. Three failure modes survive and need explicit handling.
Duplicate publishes
The worker can crash after the broker acks but before the sent_at update commits. The next run will publish the same row again. Every consumer must be idempotent. Use the outbox row id as the deduplication key. Without idempotent consumers, the outbox pattern silently turns a “missing event” bug into a “duplicate event” bug, which is rarely an improvement. For the broker-agnostic version, see Martin Fowler’s patterns of distributed systems ↗ catalogue.
Out-of-order delivery
If publisher latency varies, a later event can be published before an earlier one. Per-aggregate ordering is easy to recover: partition the broker topic by aggregate_id and most brokers will preserve order within the partition. Strict global ordering across all events in a topic is harder and usually not worth the cost. Decide which guarantee you actually need.
Unbounded outbox growth
Every row that publishes successfully is still sitting in the table. The simplest mitigation is a scheduled job that deletes rows with sent_at older than 7 days. Keep enough history to debug a misbehaving consumer, but do not hoard. If you need permanent event history, that is the job of an event store, not the outbox.
DELETE FROM outbox
WHERE sent_at IS NOT NULL
AND sent_at < NOW() - INTERVAL '7 days';
Schedule this with the same job runner you would use for any periodic maintenance work. The mechanics are covered in the developer’s guide to background jobs and task queues.
How It Fits with Surrounding Patterns
The outbox is rarely the only pattern in play. It composes with:
- Idempotent consumers. Required. The outbox guarantees at-least-once; consumers handle the duplicates.
- Sagas. A multi-service workflow where each step publishes events. The outbox makes each saga step durable.
- Event sourcing. A different pattern where state is rebuilt from events. The outbox can produce the event stream that event sourcing consumes.
- CQRS. Read models built from events. The outbox is one of the cleanest ways to feed those read models.
- Retries and circuit breakers. The publisher’s retry loop is the same pattern as any resilient consumer. See building resilient APIs with retry and circuit breaker patterns for the failure modes.
The pattern also reads cleanly against the broader landscape of Chris Richardson’s microservice patterns ↗, where it sits next to saga, CDC, and the inbox pattern (the consumer-side counterpart).
For the broker side of the choice, message queues for developers: RabbitMQ vs Kafka vs SQS covers the tradeoffs that decide whether you reach for CDC or polling. And for the architectural style this pattern serves, understanding event-driven architecture is the wider context.
A Minimal End-to-End Example
A complete order creation flow using the outbox:
// HTTP handler
app.post('/api/orders', async (req, res) => {
const order = await db.transaction(async (tx) => {
const row = await tx.orders.insert({
customer_id: req.body.customerId,
total_cents: req.body.totalCents,
status: 'pending',
});
await tx.outbox.insert({
aggregate_type: 'order',
aggregate_id: row.id,
event_type: 'order.created',
payload: {
id: row.id,
customer_id: row.customer_id,
total_cents: row.total_cents,
},
});
return row;
});
res.status(201).json(order);
});
// Publisher worker, runs continuously
async function runPublisher() {
for (;;) {
const claimed = await db.query(`
SELECT id, event_type, payload
FROM outbox
WHERE sent_at IS NULL
ORDER BY created_at
LIMIT 200
FOR UPDATE SKIP LOCKED
`);
if (claimed.rows.length === 0) {
await sleep(200);
continue;
}
for (const row of claimed.rows) {
try {
await broker.publish(row.event_type, row.payload, {
messageId: String(row.id),
});
await db.query(
`UPDATE outbox SET sent_at = NOW() WHERE id = $1`,
[row.id],
);
} catch (err) {
logger.error({ err, id: row.id }, 'outbox publish failed');
break;
}
}
}
}
A few things worth pointing out:
- The HTTP handler does no broker work. The response returns after the database commit, so latency is unaffected by broker health.
- The publisher claims rows with
FOR UPDATE SKIP LOCKED, so running multiple instances scales horizontally without coordination. - Failed publishes leave
sent_atnull; the row is picked up again next loop. - Logged errors should drive alerting. A growing backlog of unsent rows means the broker is degraded, the publisher is stuck, or both.
What to Monitor
Three signals matter:
- Outbox backlog size. Count of rows where
sent_at IS NULL. A growing backlog is the leading indicator of a stuck publisher or a broker outage. Alert on a threshold that reflects normal throughput. This is the kind of indicator that earns a place in your reliability work, covered in service level objectives: a developer’s guide to SLOs. - Outbox publish lag. The age of the oldest unsent row. A useful proxy for end-to-end event latency.
- Cleanup job health. The deletion job’s last successful run and the row count older than the retention window.
A useful diagnostic query:
SELECT
COUNT(*) FILTER (WHERE sent_at IS NULL) AS unsent,
COUNT(*) FILTER (WHERE sent_at IS NOT NULL) AS sent,
EXTRACT(EPOCH FROM NOW() - MIN(created_at) FILTER (WHERE sent_at IS NULL)) AS oldest_unsent_seconds
FROM outbox;
Run it on a cadence; surface the result on a dashboard.
When Not to Use the Outbox
The pattern has real cost: a table, a publisher process, monitoring, a cleanup job, and a discipline of writing events inside transactions. Skip it when:
- You publish no events. A service with no external notifications has no dual-write problem.
- The downstream effect of a lost event is genuinely acceptable. A best-effort analytics event qualifies; a billing event does not.
- Your write path already lives inside the broker. If you are using Kafka as your source of truth and your service produces directly to a topic with a producer that commits transactionally with your local state (Kafka Streams, exactly-once semantics), you have a different topology with its own constraints.
For everything in the middle, where you have a database, a broker, and downstream consumers that need every event, the outbox is the smallest amount of complexity that gives you correctness.
The Short Version
- The dual-write problem is real and bites quietly. Reconciliation jobs eventually find the gaps, but late.
- The outbox pattern collapses two writes into one transaction by storing the event in your own database.
- A polling publisher is the right starting point for almost every team; log-based CDC is the right upgrade when polling becomes a measurable cost.
- Consumers must be idempotent regardless. The outbox guarantees at-least-once, not exactly-once.
- Monitor the backlog and the oldest unsent row. Both are leading indicators of broker or publisher trouble.
The outbox does not eliminate distributed systems problems. It moves them to the parts of the stack that are designed to handle them: durable storage, idempotent consumers, and operational visibility. That is most of what reliable event publishing actually is.
Frequently asked questions
What is the transactional outbox pattern?
The transactional outbox pattern is a way of publishing events to a message broker without risking inconsistency between your database and the broker. Instead of writing to the database and then publishing to a broker in two separate steps, you write the event to an outbox table inside the same database transaction as the business change. A separate process then reads from the outbox and publishes to the broker, retrying until the publish succeeds. The single transaction guarantees the event exists if and only if the business change committed.
What problem does the outbox pattern actually solve?
It solves the dual-write problem: when a service needs to update its own database and notify other services through a message broker, doing both as independent operations means either can succeed while the other fails. The classic failure mode is committing an order to Postgres, then crashing before the OrderCreated event reaches RabbitMQ. The next service never hears about the order. The outbox makes the broker publish a consequence of the database write rather than a separate concern, so the two cannot drift apart.
How does the outbox pattern differ from the change data capture (CDC) pattern?
An outbox is the table where you write events deliberately as part of business logic. CDC is the mechanism that reads those rows and ships them downstream. Debezium reading from the Postgres write-ahead log is a CDC implementation; it can be pointed at your outbox table to publish events to Kafka with no extra polling. Some teams skip the outbox table and have CDC stream every business table directly, but that exposes raw schema changes to consumers, which is rarely what you want.
Do outbox events guarantee exactly-once delivery?
No. The outbox guarantees at-least-once delivery to the broker, because the publisher will retry until the broker acks. Consumers will still occasionally see duplicates because of network issues, broker re-deliveries, or publisher crashes after the broker accepted the message but before the outbox row was marked sent. Consumers must be idempotent. Pair the outbox with idempotent receivers and you get the practical equivalent of exactly-once processing without the impossible guarantee.
Is the outbox pattern overkill for a small service?
It depends on whether you publish events at all. A single service with no message broker has no dual-write problem and needs no outbox. A small service that fires an SNS notification on user signup probably can live with the occasional missed message in early days, especially if the consumer is non-critical (a marketing email). The moment a downstream system materially depends on the event, for example billing, inventory, or fraud detection, the outbox is the cheapest correctness insurance you can buy.
Enjoyed this article? Get more developer tips straight to your inbox.
Comments
Join the conversation. Share your experience or ask a question below.
No comments yet. Be the first to share your thoughts.