Exponential Backoff with Jitter — Part 1: The Problem

Every distributed system fails. Networks drop packets. APIs rate-limit you. Servers restart mid-request. The question isn’t whether your Node.js app will hit a failed HTTP call — it’s whether it will recover gracefully or crash your user’s experience.

Retry logic is the foundation of that recovery. But naive retries — hammering a struggling server the moment a request fails — can turn a minor blip into a full-scale outage.

The thundering herd problem

Imagine 500 clients all hit the same API. The server hiccups at 2:00:00 AM and every request fails simultaneously. Without any strategy, all 500 clients retry at 2:00:01 AM — now the server faces 500 new requests while it’s already struggling.

This is the thundering herd problem. The AWS architecture team describes retries as selfish: a retrying client asserts that its request is important enough to spend the server’s resources again. Too many selfish clients at once, and you’ve turned a 5-second hiccup into a multi-minute cascading failure.

The solution involves three layered techniques.

Tool 1 — Timeouts

A timeout is a hard upper bound on how long you’ll wait for a response. Without one, a slow server can hold your request open indefinitely — consuming memory, file descriptors, and connection slots.

The tricky part is setting the right value. Too low and you’ll retry perfectly healthy (just slow) requests, adding unnecessary load. Too high and the timeout provides no real protection. A practical starting point: look at the p99 latency of the downstream service and add 20–30% headroom.

Tool 2 — Exponential Backoff

Instead of retrying immediately, you wait longer after each failure. The delay grows exponentially:

Attempt 1 failure → wait  1,000ms
Attempt 2 failure → wait  2,000ms
Attempt 3 failure → wait  4,000ms
Attempt 4 failure → wait  8,000ms  ← capped here
Attempt 5 failure → wait  8,000ms  ← cap applies

The cap is critical. Without it, a 2× growth factor on 10 attempts produces a 512-second wait — completely unacceptable for a user-facing application.

Tool 3 — Jitter

Jitter adds randomness to the delay. Even with exponential backoff, if all clients experienced the same failure at the same time, they all back off to the same interval and retry in a synchronized spike.

There are three common strategies:

Strategy	Formula	Tradeoff
No jitter	`base × 2ⁿ`	Fast but causes spikes
Full jitter	`random(0, base × 2ⁿ)`	Best load distribution
Equal jitter	`(base × 2ⁿ / 2) + random(0, base × 2ⁿ / 2)`	Prevents very short sleeps
Decorrelated jitter	`random(base, lastDelay × 3)`	Fewest total retries

AWS recommends full jitter for most use cases — it’s simple to implement and gives maximum spread. AWS’s simulation data (from Marc Brooker’s 2015 analysis) shows full jitter produces the fewest total retry calls under high contention.

The mental model

Request fails
    │
    ├── Network error or timeout?           → retry with backoff + jitter
    ├── 4xx client error (not 408/429)?     → throw immediately, do not retry
    └── 5xx or 408/429?                     → retry with backoff + jitter
            │
            ├── Check Retry-After header first
            ├── Apply full jitter: random(0, min(cap, base × 2ⁿ))
            └── Increment attempt counter

Not every failure is worth retrying. A 400 Bad Request means your payload is broken — retrying the same request won’t fix it. A 503 Service Unavailable means the server is struggling — backing off and retrying makes sense.

What NOT to do

Retry without a cap:

// BAD: attempt 10 waits 1,024,000ms = 17 minutes
const delay = BASE_DELAY_MS * Math.pow(2, attempt);

Retry 4xx errors blindly:

// BAD: A 400 Bad Request will never succeed
if (!response.ok) retry();

Ignore Retry-After headers:

// BAD: The API told you to wait 30 seconds. You waited 1s and got rate-limited again.
await sleep(BASE_DELAY_MS);

Retry at multiple layers:

// BAD: Both layers retry — failure load multiplies across the stack.
// A 5-retry policy on 3 layers = up to 125 attempts for one request.
while (page <= totalPages) {
  try {
    await fetchWithRetry(url); // already retries internally
  } catch {
    page--; // outer retry on top of inner retries
  }
}