Retries Are a Smell

Why I stopped trusting retry loops in production systems.

There's something comforting about a retry loop.

Wrap a block of code in try/except, sleep a little, try again. Feels responsible. Feels resilient.

It isn't.

Retries are often a smell.

Not always. But often.

They're a signal that something deeper isn't designed correctly - and instead of fixing the underlying structure, we're asking the system to "just try harder."

That's not resilience. That's optimism.

And optimism is not a strategy in production.

The Illusion of Safety

A retry makes you feel safe because it handles transient failure.

Network glitch? Retry.

Database timeout? Retry.

Temporary lock? Retry.

Sure.

But here's what most retry logic actually does:

  • It hides race conditions.
  • It amplifies load during partial outages.
  • It makes nondeterministic systems look stable.
  • It papers over broken invariants.

The moment you rely on retries as your primary failure strategy, you've admitted the system isn't deterministic.

You're not solving the problem. You're gambling that the next attempt will be luckier.

Retries Don't Fix Broken State

If your operation isn't idempotent, retries are dangerous.

If your state transitions aren't atomic, retries are dangerous.

If your system doesn't know what "correct" looks like, retries are dangerous.

Because now you're not just retrying a failed action.

You're potentially duplicating side effects.

Sending two emails. Charging twice. Uploading the same file twice. Marking something complete that never finished.

Retries multiply uncertainty unless the underlying system is built around invariants.

What Actually Makes Systems Survive

In production systems - especially long-running background workers - what matters isn't how often you retry.

What matters is this:

  • Can the system restart at any moment without corrupting state?
  • Can an operation run twice without breaking correctness?
  • Is the source of truth unambiguous?
  • Can you derive current state from durable data?

If those are true, retries become boring.

If those are false, retries become dangerous.

The goal isn't "keep trying."

Make it safe to try again.

The Amplification Problem

Retries amplify failure.

If a dependency is partially down and you have many workers retrying aggressively, you can turn a hiccup into a self-inflicted outage.

Your system starts fighting the thing it depends on.

A small timeout turns into a cascade. Logs explode. CPU spikes. Operators panic.

All because retry on failure was treated as a safety blanket instead of a last resort.

Retries Are a Tool - Not a Strategy

Retries are correct when:

  • The operation is idempotent.
  • The state machine is explicit.
  • The failure is genuinely transient.
  • Attempts are bounded.
  • You can prove it won't corrupt data.

Retries should exist inside a design that is already safe.

They should not be compensating for one that isn't.

The Real Shift

Stop asking, "How many times should we retry?"

Start asking, "Why can this fail in a way that leaves the system confused?"

If every operation is safe to re-run, completion is explicit, and truth is durable, retries become boring implementation details.

Boring systems survive.

The Smell Test

  • What invariant am I avoiding defining?
  • What state transition is ambiguous?
  • What happens if this runs twice?
  • What happens if the process crashes halfway through?

If you can't answer those clearly, the retry isn't your fix.

It's your warning light.

← Back to Blog