6 min

Reliability

Retry Policies, Circuit Breakers & Dead Letter Queues

The Failure Taxonomy

Production systems fail in predictable ways. Each failure type needs a specific defense:

Failure	Example	Defense
Transient	Network timeout, 503 error	Retry with exponential backoff
Sustained	Service down for minutes	Circuit breaker (fail fast)
Permanent	Malformed event, missing data	Dead letter queue (manual fix)
Duplicate	Webhook retry delivers same event	Idempotency guard

Using the wrong defense makes things worse. Retrying a permanent failure wastes resources. Circuit-breaking a transient error causes unnecessary downtime. The key is matching the defense to the failure type.

Retry with Exponential Backoff

When a transient error occurs (timeout, 503, rate limit), retrying usually works — the network reconnects, the server recovers, the rate limit window resets.

But retrying immediately is dangerous. If a service is struggling under load, hammering it with retries makes it worse. Exponential backoff increases the wait between retries:

Attempt 1: wait ~1 second
Attempt 2: wait ~2 seconds
Attempt 3: wait ~4 seconds
Attempt 4: wait ~8 seconds (capped at 30s)

Jitter adds randomness (50-100% of the calculated delay). Without jitter, 100 workflows that fail at the same time all retry at the same time — a "thundering herd" that overwhelms the recovering service. Jitter spreads the retries across time.

function calculateDelay(attempt: number): number {
  const exponential = baseDelay * Math.pow(2, attempt);
  const capped = Math.min(exponential, maxDelay);
  return capped * (0.5 + Math.random() * 0.5); // jitter
}

Not every error is retryable. A 400 Bad Request will never succeed no matter how many times you retry. The retry policy classifies errors: network errors (ETIMEDOUT, ECONNRESET) and server errors (503, 429) are retryable. Client errors (400, 404) are not.

Circuit Breaker

Retries handle transient failures. But what if the service is truly down — not a momentary blip, but a sustained outage?

Without a circuit breaker, every workflow execution waits for 3 retries x 30 seconds = 90 seconds before failing. With 100 concurrent workflows, you have 100 connections all waiting for a dead service.

The circuit breaker is a state machine:

Closed (normal): Requests pass through. Failures increment a counter.

Open (rejecting): After 5 consecutive failures, the circuit opens. All requests are immediately rejected in <1ms — no waiting for timeouts. This prevents resource exhaustion.

Half-open (probing): After a cooldown (30 seconds), the circuit allows ONE probe request. If it succeeds, the circuit closes (service is back). If it fails, the circuit stays open for another cooldown period.

Each downstream service gets its own circuit breaker. If the AI API goes down, it doesn't affect the CRM API or the email service.

Dead Letter Queue

Some events will never succeed. The payload is malformed. The customer account was deleted. The required API endpoint was decommissioned. After exhausting all retries, these events land in the dead letter queue (DLQ).

The DLQ is the safety net for data integrity. Without it, failed events disappear silently — a support ticket never gets processed, a payment never gets confirmed, and nobody knows until a customer complains.

The DLQ stores:

The original event (complete payload for replay)

The error message and retry count

The timestamp and workflow that failed

Operators can:

Inspect — See all unresolved entries, filter by error type

Fix and replay — Correct the underlying issue, then replay the event through the workflow

Resolve — Mark entries as handled after manual intervention

DLQ depth is a critical operational metric. If entries are accumulating faster than they're being resolved, something is systemically wrong.

Idempotency

Webhook providers retry delivery. Stripe re-sends if it doesn't receive a 200 within 5 seconds. Zendesk retries up to 3 times. GitHub retries on network errors.

Without idempotency, the same support ticket gets processed twice. Two draft responses are generated. Two Slack notifications fire. The customer gets duplicate emails.

The idempotency guard is simple: before processing an event, check if its ID has been seen before. If yes, skip it and return the previous result. If no, process it and record the ID.

if (idempotencyStore.isDuplicate(event.id)) {
  return previousResult; // Already processed
}
// ... process event ...
idempotencyStore.record(event.id, executionId, "success");

The store uses FIFO eviction to prevent unbounded memory growth — old event IDs are purged when capacity is reached. In production, use Redis with TTL expiration.

What's Next

In Module 5, you'll build the dashboard that makes all of this visible — execution timelines, node outputs, error logs, and live event triggering. The reliability layer ensures the system works; the dashboard ensures you know it's working.

This is chapter 4 of AI Workflow Automation.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 3: Workflow Engine

Ch. 5: Automation App